In the spring 2009, a new influenza strain A(H1N1)pdm09 aka “swine flu” appeared, causing the world-wide pandemic. Luckily, swine flu happened to be a very mild infection, and the harm was insignificant. I have studied the Finnish branch of the global pandemic since 2010. I’ve made my master thesis about it, published a paper and going to submit another paper soon. This is a story about the data visualization in my upcoming paper. Next post will be devoted to the visualization of the results.
I will start with the data visualization I made 4 years ago for my Master’s thesis:
(A) The number of new cases per week. The horizontal axis shows the week, from week 18 of 2009 to week 5 of 2010. The data consist of several layers: cases identified with A influenza, cases specifically identified with A(H1N1) influenza; cases identified with A(H1N1) influenza and assigned to hospital. (B) The total number of cases by age group. (C) The number of cases per 10000 individuals by age group. (D) The total number of cases by region. (E) The number of cases per 10000 individuals by region.
I still think it was a nice visualization (especially I love the idea of showing absolute values with bars and relative values with ticks), but now I’m able to see a lot of mistakes.
1) I was making this visualization for myself, not for the reader. Almost every value is signed. It helped me to analyze the data, but it is nothing but a visual noise for the common reader. Now I would remove the numbers from the picture and put them to the separate table. Figures should go with an axis and a grid.
2) Time axis on the panel (A) needs to be labelled with day-month-year, not with the week numbers.
3) Wrong order of colors: red and green should not be standing close to each other. Green reeds really bad on red.
4) Panels (B)-(E) should be rotated so that the labels would be horizontal.
5) Regions on the panels (D) and (E) should be ordered according to their size or their attack rate. Now they just stand in the order I got them.
6) Panel labels are too fat.
Now lets go to the upcoming paper. I’m skipping all the sketches which were not supposed to go into print. Here is the first publishable version:
Panels A and B: the numbers of ascertained cases per week on the absolute and log scales. Panel C: the numbers of vaccinated individuals per week. The shaded areas mark the first and the second epidemic seasons. Panels D: the numbers of ascertained cases per age group in the first and second seasons. Panels E: population sizes and the numbers of vaccinated individuals per age group.}
There are some changes in the dataset. Studied period is twice longer and includes the second epidemic season. Cases are grouped in the different way: A influenza and A(H1N1) influenza case are merged into one group (mild), Intensive care cases are separated from the hospitalized. Information on vaccination is included.
I use both absolute and log scales (panel A and B) because absolute scale hides the details (red line is much smaller then the green line) while log scale is hard to read alone. The time axis is labelled with the real dates corresponding to important events. The epidemic season (most important periods) are highlighted.
This picture looked very nice on my screen, but then I printed it out I realized that the grid lines and shading are invisible. I made a darker version:
Then I started to think about the panel D. It shows the data by the age group. I had to use the log scale here, because numbers of cases are negligible compared to population size. But the log scale is hard to read, so I decided to move to several absolute-scale plots instead
Looks better! I was able to include more information: I made two plots for the first and second season (panel D) and a separate plot for the population size and vaccination (panel E).
After some consideration, I decided that labelling the time-axis with some arbitrary dates was not the best idea. The original plan was to label some important events: start of the first season, peak of the first outbreak etc., so if would be easy to see how vaccination correspond to these events. This was not working well. I decided to use a simpler labels for the axis, but superimpose detected cases onto vaccinated individuals.
Also, axes and grid lines were made thicker and background were made darker to be more visible on print.