Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Data Science

Machine learning & data science for beginners and experts alike.
SusanCS
Alteryx Alumni (Retired)

We’ve all got that one Facebook friend who posts charts that make you cringe: the ever-popular 3-D pie chart, the questionable bar chart with no source, the scatter plot “proving” that X causes Y.  

 

Unfortunately, the coronavirus pandemic provides new material for the same data visualization problems that have always existed. In a time of uncertainty and fear, poorly designed visualizations can spread misinformation and provoke even more emotion. 

 

I recently read Alberto Cairo’s 2019 book, How Charts Lie: Getting Smarter about Visual Information, and have pulled out some major points from the book that can inform how we create and analyze data visualizations about any topic. Cairo, a journalist, designer, and University of Miami professor, is also the author of the well-known data visualization books The Functional Art and The Truthful Art. 

 

How Charts Lie is an excellent primer for anyone wanting to develop their “graphicacy,” or graphical literacy, and is a terrific reminder of key principles of good visualization design for anyone who ever generates a chart. Though the book’s title sounds negative, Cairo focuses primarily on the ways designers may inadvertently misrepresent their data and mislead viewers. To be sure, there are those who make charts deliberately to mislead and confuse, but this book is mainly for readers hoping to make a good, honest chart, or who just want to understand others’ charts better. 

 

I’ve chosen a few recent coronavirus-related visualizations to demonstrate some of Cairo’s major points. Instead of being like your cringe-y Facebook friend, though, I’ll show you some interesting visualizations that show the thoughtful, rigorous communication of data that Cairo encourages. As Cairo notes, “Public debates in modern societies are driven by statistics, and by charts, which are the visual depiction of those statistics.” Just having a chart lends you authority and credibility. That’s a power to wield carefully! Let’s see how it’s being done well at this critical time.

 

Simpler Is Not Always Better

 

Should a good chart be understandable at a glance? Not necessarily, Cairo says. 

 

“Contrary to what many people believe, most good charts aren’t simple, pretty illustrations that can be understood easily and intuitively,” Cairo writes. Complexity is sometimes necessary to effectively communicate complicated things: “Many [charts], particularly those that contain rich and deep messages, may require time and effort, which will pay off if the chart is well designed. Many charts can’t be simple because the stories they tell aren’t simple.”

 

For example, the chart below from the Financial Times is pretty complicated at first glance. (All charts included here are from March 25.)

 

FT.png

 

There’s a lot going on here -- numbers on two vertical axes! Colors! Dashed and dotted lines! And some stars sprinkled about! Whew. And yet, investing time in understanding this chart pays off. Not only can you compare the number of coronavirus deaths in each country, but it’s also easy to see the countries’ trajectories, and how the slope of each country’s line compares to benchmarks (the dashed lines marking “deaths double every day,” two days, etc.). Continental trends are also visible through the lines’ color-coding. The stars mark significant events. Taking time to fully grasp each element of this chart provides the viewer with a ton of information. 

 

That chart tells the viewer a lot about the pandemic story. But does it tell us why these countries have such different trajectories -- for example, why South Korea and Japan look so different from other countries? Or how those starred events affected the disease’s spread -- for example, whether Spain’s lockdown decision made any impact? 

 

As well constructed as it is, this chart can’t answer these questions. Cairo says that charts “just help us discover intriguing features that may later lead us to look for those answers by other means. Good charts empower us to pose good questions.” Those are the bigger, deeper questions that will help policymakers and public health experts figure out the right path forward.

 

Be Honest About Uncertainty

 

We’d all love to be able to generate completely accurate measurements and predictions, but that’s not realistic (and recent events sure show that, well, you just never know what’s coming next). But the presence of statistics and a chart can imply certainty, even if you don’t mean to imply that your data provide a definitive answer. Cairo writes, “Uncertainty confuses many people because they have the unreasonable expectation that science and statistics will unearth precise truths, when all they can yield is imperfect estimates that can always be subject to changes and updates.” 

 

Cairo says that the “crisp and sharp” edges of a nice, clean traditional chart can be misleading, and we should be “mentally blurring” those edges to allow uncertainty into our understanding of the data. Chart designers can also incorporate literal blurriness. As an example, Cairo offers the below chart he created to represent election polling data. The blurry gradients around the point estimate display the margin of error and the resulting uncertainty about the election outcome.

 

SusanCS_1-1585328326622.png

 From Cairo’s free online repository of graphics from the book.

 

A current example of this kind of “blurriness” that I found effective is the bottom portion of this pyramid chart from Our World in Data, which has assembled a fascinating set of constantly updated visualizations related to the pandemic. The designers acknowledge here that though we have a reasonably good handle on the data in the upper portions of the pyramid, there’s still uncertainty around the true number of actual coronavirus cases. The blurry bottom portion of the pyramid is also the widest, showing there could have been a great many unrecognized cases, beyond even the thousands of known mild cases.

 

Pyramid.png

 

 

Acknowledging the presence of uncertainty, even in more routine business and life situations, can help your chart’s viewers take away a more realistic understanding of your data and the insights it can offer. Some familiar examples of showing uncertainty include displaying ranges or confidence intervals on a chart; showing time series forecasts with a fan chart (like the one below, generated with the TS Forecast tool, that displays 80% and 95% confidence intervals in dark and light gray respectively); or sharing a distribution instead of a single measure of central tendency (e.g., a mean or median) when a solitary value may not effectively capture the possibilities for a variable. 

 

chart.PNG

 

Use Maps Wisely

 

Cairo also discusses the use of maps, which he says are some of “most misused” data visualizations. For example, a chart creator might color a U.S. map with different shades in each state to show how many customers a nationwide company has in each state, but in doing so ends up really just reflecting a larger or smaller state population. After all, the more people, the more likely there’s a large number of customers there. Instead, Cairo suggests, it usually makes more sense to use the adjusted data for each area, like a per capita measurement or a percentage of the population, in lieu of the raw data. 

 

There are probably thousands of maps on the internet displaying coronavirus-related data, including static, animated and interactive varieties. Their creators made difficult choices about how best to display their data, though most have chosen to display raw numbers of cases in each locality instead of normalized numbers. 

 

Here are two maps from Our World in Data side by side; the map with the red color scheme displays the raw count of cases, while the map with the blue-green color scheme displays a normalized count per million population.

 

map1.png

map2.png

 

Is the map of the raw number of cases more or less informative than the map of the normalized data? To be sure, the map of the raw counts is frightening and emphasizes the intensity of the pandemic, with so many countries shown in deeper shades of red. However, the map of the normalized data raises some different questions: Why does Russia have so many fewer cases relative to its population than do other Eurasian countries? (Population density?) What is different about Central America and Africa that they have fewer cases so far? (Weather?) Are there other questions we can generate from studying the mapped normalized data that could help us cope with this pandemic or prevent future ones? It’s not that one approach is right or wrong; both maps are useful. They each tell part of the story, and they each provoke interesting questions.

 

I also recently looked at the map below from The Oregonian, which displays the number of hospital beds per thousand residents in each of Oregon’s counties:

 

Oregon.png

 

The map is interactive on their website. Hovering over any county displays additional data, including the total and available hospital beds, the population and how many people are over 65, the poverty rate, and the hospital beds per 1,000 people. 

 

The map intrigued me not only because it’s relevant to me as an Oregonian, but also because it demonstrates another challenge of map design. Typically, when we look at a map like this, we are seeing more of something displayed by a more saturated shade of the selected color. A quick glance at this map and the title seems to imply “more blue, more beds” -- but it’s the opposite. The legend reveals that a deeper blue means fewer beds per 1,000 county residents; in other words, the problem is more intense where the color is more intense. This choice reveals an interesting dilemma for the map creators: whether to color the map in a way that correlates with the numbers, or to emphasize the issue the numbers represent.

 

Alteryx co-founder Ned Harding also recently explored the challenge of mapping U.S. coronavirus cases effectively. He explains his thought process in a blog post with examples created in Alteryx, starting with an initial effort that displayed the cases per 100,000 of population in counties. Finding that less approach than ideal, he ultimately created a map with individual points for each case, mapped onto the states in which they occurred, plus animation to demonstrate the growth in cases -- an element of the story that is hard to communicate with a static map. 

 

With a complex issue like the pandemic, there are many options for displaying data, and each choice tells a slightly different side of the story. Similarly, in our everyday data visualization, we can consider what parts of a data story our charts and maps emphasize or omit. What is the key takeaway we want our audience to have? A better understanding of an overall problem versus just the numbers? A sense of the change over time versus the situation at a specific moment? What angle matters most to move the discussion forward?

 

Keeping an Open Mind

 

Interacting with The Oregonian’s map revealed some insights I didn’t expect about the state I’ve chosen to call home. Whatever the topic of a data visualization, Cairo also emphasizes that we must approach it with an open mind as much as possible. “The more we cherish an idea, the more we’ll love any chart that corroborates it,” he writes. We all tend toward confirmation bias and rationalization, and we’ll interpret visualizations in ways that support our existing beliefs. 

 

As we move forward through -- and, eventually, beyond -- this pandemic, we’ll start to see new visualizations showing the end of the disease’s spread, and others for all kinds of issues that will be reimagined in the world that comes after. 

 

Charts and maps will be “conversation enablers” in that process, to use Cairo’s phrase. The more we can craft useful visualizations and encourage their careful use and interpretation, the better the conversations we can have, whether within a business or industry or in society more broadly.

Susan Currie Sivek
Senior Data Science Journalist

Susan Currie Sivek, Ph.D., is the data science journalist for the Alteryx Community. She explores data science concepts with a global audience through blog posts and the Data Science Mixer podcast. Her background in academia and social science informs her approach to investigating data and communicating complex ideas — with a dash of creativity from her training in journalism. Susan also loves getting outdoors with her dog and relaxing with some good science fiction. Twitter: @susansivek

Susan Currie Sivek, Ph.D., is the data science journalist for the Alteryx Community. She explores data science concepts with a global audience through blog posts and the Data Science Mixer podcast. Her background in academia and social science informs her approach to investigating data and communicating complex ideas — with a dash of creativity from her training in journalism. Susan also loves getting outdoors with her dog and relaxing with some good science fiction. Twitter: @susansivek