Swine flu epidemic in Finland 2009-2011: Data

In the spring 2009, a new influenza strain A(H1N1)pdm09 aka “swine flu” appeared, causing the world-wide pandemic. Luckily, swine flu happened to be a very mild infection, and the harm was insignificant. I have studied the Finnish branch of the global pandemic since 2010. I’ve made my master thesis about it, published a paper and going to submit another paper soon. This is a story about the data visualization in my upcoming paper. Next post will be devoted to the visualization of the results.

I will start with the data visualization I made 4 years ago for my Master’s thesis:

thesis

(A) The number of new cases per week. The horizontal axis shows the week, from week 18 of 2009 to week 5 of 2010. The data consist of several layers: cases identified with A influenza, cases specifically identified with A(H1N1) influenza; cases identified with A(H1N1) influenza and assigned to hospital. (B) The total number of cases by age group. (C) The number of cases per 10000 individuals by age group. (D) The total number of cases by region. (E) The number of cases per 10000 individuals by region. 

I still think it was a nice visualization (especially I love the idea of showing absolute values with bars and relative values with ticks), but now I’m able to see a lot of mistakes.

Continue reading “Swine flu epidemic in Finland 2009-2011: Data”

Advertisements
Swine flu epidemic in Finland 2009-2011: Data

One of the worst infographics ever

Serif or sans?

There is a long story concerning the choice of the font family. Here is an infographics introducing the discussion. In seems very nice until you realize that all  the statements in that infographics actually have been scientifically disproved.

One of the worst infographics ever

Making a Visualization Better

Here is a piece of my work. I was writing  a method paper (waiting for the publication) with my collaborators. We needed figures illustrating how the method works. Some visualization had been made already. I tried to improve on them, and here is what I done.

The paper concerns BIOLOG phenotype microarrays: a lab equipment that measures metabolism of bacteria in time. So basically we need to draw a bunch of time series.

Original version: Figure1_-_копия

Continue reading “Making a Visualization Better”

Making a Visualization Better

Reinventing histogram. New visualization tools for representing 1-dimensional posterior distribution

In applied Bayesian statistics, the results of the research typically consist of the posterior distribution over the space of parameters. These posteriors are usually smooth continuous functions, approximated numerically with the set of samples obtained with methods like MCMC, IS or ABC.

Good Visualization of such approximation is problematic, even if the posterior has only one dimension. On one hand, simple stack of all the samples is a noisy saw-like picture, very different from the function we are trying to approximate. On the other hand, by smoothing this picture we amputate a part of the results. The choice of the appropriate visualization tool is therefore the compromise between the readability and information content. But should it be like this? Is there a visualization method witch could aim for both?

UV_5

Lets show both the raw data and the difference between the raw and the smoothed data!

Continue reading “Reinventing histogram. New visualization tools for representing 1-dimensional posterior distribution”

Reinventing histogram. New visualization tools for representing 1-dimensional posterior distribution

Principles of posterior visualization

In Bayesian statistics results appear in the form of the posterior distribution: measure of uncertainty quantified in the terms of probability. Bayesian statistics is a mature field. However, visualization of the posterior distributions have not been understood as a distinct problem. Inappropriate visualization methods developed for other types of problems are widely used. Methods for visualizing posterior over the space of complex objects (e.g. graphs, phylogenetic trees, clusterings, alignments, covariance matrix etc.) are immature. Here I try to establish a few principles which can be used to filter out improper visualization techniques and develop the correct ones.

Of course, to judge if a certain figure is better than another we need to understand the context. Here I am focusing on the visualization intended for communication. It means that we already have a posterior distribution, and we want to present it (or some of it’s features) in a way which would be honest, easy to understand and hard to misinterpret.

Principle 1: Uncertainty should be visualized

In Bayesian statistics, uncertainty is an essential part of the results. Not visualizing it is as wrong as concealing a half of the result.

Principle 2: Visualization of variability ≠ Visualization of uncertainty

Boxplot is a striking example for this principle. Boxplot is a prefect tool for showing a variability in the data, but it should not be used for visualizing the posterior distribution.The inner interval of the boxplot contains almost the same probability mass as the outer intervals, but are presented it a completely different way. This deceives the reader, leaving the overconfident impression about the estimates.

UV_2

Continue reading “Principles of posterior visualization”

Principles of posterior visualization

The Single Axiom of Visualization

Visualization (as well as art or design) is somehow considered to be subjective business. One likes blue, another likes yellow. One like scatter-plots, another like pie-charts. Any discussion about visualization is a discussion about personal preferences, and therefore meaningless.

Is it true? Or is it possible formulate an objective criteria for good visualization? I believe there is a single criteria which would be simple and general enough. I call it the single axiom of visualization:

Visualization is here to do certain stuff

The point here is: any evaluation of any figure must be based on two fundamental questions: what this figure is trying to do, and is it efficient in doing so? Any other criteria (such us Tufte’s data/ink ratio) should be applied keeping this two questions in mind.

In science, visualization is used in two main situations:

1. Exploratory analysis. Visualized is made to reveal some hidden trends, patterns and oddities. In this situation, one usually doesn’t know what he/she is searching for. Figures are meant for the researcher himself (or his research group). One can create as many figures as he/she wants and as complex as he/she requires.

2. Illustration in the scientific publication. Visualization is made to present, explain or prove the findings of the researcher (i.e. for the same reason as the text in the publication). One always knows what point he/she wants to demonstrate with particular figure. Figures will be read by wide, but educated audience. Space in the publication itself is usually limited so that only a few figures can be used, yet sometimes supplementary materials can be exploited. When preparing figures for publications, one should keep in mind that his/her publication could be printed out in black and white.

Depending on the situation, goals of the visualization are different, so the principles of visualization should be different too. For example, one should not present information using the contrast between red and green, as colorblind readers will be unable to see it. But why  should one follow this rule while doing exploratory visualization, and knowing that he/she is not colorblind?

PS:  Andrew Gelman discusses this issues a lot, for example here here

The Single Axiom of Visualization