Happy 8th birthday to the Maveryx Community! Take a walk down memory lane in our birthday blog, and don't miss out on the awesome birthday present that all Maveryx Community members get to take advantage of!
Our main navigation will be under construction for the next hour. "Launch Pad" will be moving to "Learn", and the "Introduce Yourself" forum will be moving to a thread in the "General" discussion.

Data Science

Machine learning & data science for beginners and experts alike.
Alteryx Alumni (Retired)

You walk down one aisle of the grocery store to get your favorite cereal. On the dairy aisle, someone sick from COVID-19 coughs. 


Did your decision to grab your cereal before your milk possibly keep you healthy?


It’s the kind of question we’re all asking as we make simple daily decisions during this global pandemic. But imagine now that it’s your job to create a model for the spread of COVID-19 that accounts for humans’ unpredictable, near-random choices. Your model also could include government mandates for social distancing, hospital care availability, pre-existing conditions among a population, and more.


Sound complicated? It sure is, but researchers in a variety of fields are working to create the most accurate, useful models possible to predict and explain the spread of COVID-19. I’m not an epidemiologist, but if you too have been impressed and intrigued by the data visualizations and modeling publicized so far, this post is for you. 


I explored some recent research to learn a little bit how these researchers go about what some call outbreak analytics. Here’s an introductory look at just a few components of this unique analytics process. We’ll find some lessons for other analytics applications, too.


Estimating Virus Transmission


“Once we know the R0, we'll be able to get a handle on the scale of the epidemic.” 

- the movie Contagion




The reproduction number, or R0 (pronounced “R naught”), refers to the number of people someone who is ill is likely to infect. The R0 changes in every time and place that a disease occurs due to factors like demographics, climate, social structures, and social distancing measures. Researchers calculate the R0 by determining the time between a first infection and a second infection (the generation time, which can then be charted as a generation time distribution when this time is known for many pairs of sick people). Estimating the R0 is critical to predicting a disease’s spread. For SARS-CoV-2, which causes COVID-19, the R0 is currently thought to vary between 1.5 to 3.5 in different places. 


An R package called R0 (available here) can calculate the current R0 using outbreak data. Offering five different methods for calculation, it also has an option for sensitivity analysis that shows which selection of time window or generation time provided the best fit to the data. 


There’s also a similar R package, EpiEstim, developed by a different group of researchers, that has an Excel option. EpiEstim is based on a branching process model and estimates the R0 based on simple time series data. This model tries to capture the number of people each infected person will infect, but with an element of randomness (or stochasticity) -- like a random encounter in the grocery store with an infected person. The chart below (part of a larger graphic) shows the estimates of R0 generated with this model for five past outbreaks.





Sequencing Pathogen Genomes


“It shows novel characteristics. … It’s Godzilla, King Kong and Frankenstein all in one.” -- Contagion




Genetic analysis of SARS-CoV-2 samples from patients who become ill in different places and at different times can help researchers track the spread and mutation of the virus. This analysis can also help the quick identification of possible treatments. Researchers recently demonstrated a new machine learning approach to identifying the type of an unknown virus and its various strains based on its genome (i.e., figuring out its taxonomic classification, like the broad category of “coronavirus”). 


This approach converted the SARS-CoV-2 genomic sequence into a numerical representation (see the full research paper for details). The sequences of nearly 15,000 other viruses were also used in the training data. The researchers trained six different machine learning models (Linear Discriminant, Linear SVM, Quadratic SVM, Fine KNN, Subspace Discriminant, and Subspace KNN). The trained models then output their predictions for labels at the highest taxonomic level for the genomes of the virus strains causing COVID-19. The researchers then shifted the models to focus on the next, more specific taxonomic level, and repeated the process again.


The graphic below shows the researchers’ results from their last two tests, which classified 153 viral sequences into four sub-genera and COVID-19, and then classified 76 viral sequences either as other Sarbecovirus types or as COVID-19.




This strategy helped identify not only that SARS-CoV-2 should be correctly classified with other Coronaviridae and Betacoronavirus pathogens, but also that it has important similarities to other viruses found among bats. The researchers argue that their approach is faster (under 10 minutes, including 10-fold cross-validation) and can compare more, and more diverse, samples than previous analytic processes. While there may be insights here for the current pandemic, this kind of approach could be helpful in future outbreaks.


Predicting Disease Spread Despite Uncertainty


“And from there, using our model, based on the R0 of 3.2 … here is where we expect to be in 48 hours.” -- Contagion




One epidemiological modeling approach is to create an “SIR model,” which divides the population as a whole into “compartments”: those who are Susceptible to the disease, Infected, and Removed (either recovered from the illness and presumably granted some degree of immunity, or no longer in the population because they died). 


However, generating such a model is never easy. Outbreak data can be impossible to gather accurately for many reasons, including institutional barriers, lack of testing, unknown or asymptomatic cases, and so forth. And, as we’ve seen, governments have implemented varied social distancing and isolation measures at different times, which can have unpredictable results on the number of people in that “Susceptible” compartment.


To deal with all the sources of uncertainty, one group of researchers developed what they call an “eSIR” model -- an extended model including “a time-varying probability that a susceptible person meets an infected person or vice versa,” as well as a new compartment to include people who are susceptible but choose self-quarantine. Both of these factors would vary in specific regions based on when isolation protocols were implemented. 


To further incorporate uncertainty into the model, the researchers used a Markov chain Monte Carlo (MCMC) algorithm. (Here are two explanations for MCMC: one simpler, one more complex.) The MCMC approach allows for the approximation of a distribution that can’t be directly known -- like the true number of SARS-CoV-2 infections -- or that is too expensive to figure out. The eSIR model’s forecasts aim to reveal “turning points” in the outbreak. Turning points include when the daily number of new cases stops growing and when the number of infected cases reaches its highest point. The model can also provide an estimate of the R0.


The researchers are distributing an R package called eSIR that generates the models, ggplot2 objects, and summary statistics. What’s useful about this approach is that it can help determine which quarantine strategies might be most useful and when. As the researchers say, “... too strict quarantine can cause backfire; people may lose their trust and patience in their committed system, and consequently may try to reduce compliance or even avoid quarantine.” The risk of implementing strict quarantine systems for prolonged periods has to be weighed against the gains in disease prevention. This model provides one approach to that important calculation.




Takeaways for All Modeling


What lessons can we learn from these still-preliminary models that extend beyond the pandemic? There are a few key items. 


First, these models each show rapid innovation to address an extremely complex situation. (The genomic and 'eSIR' studies described above are still “preprints,” meaning they have not yet been peer reviewed and have been released as quickly as possible to contribute to the scientific community’s effort to address the pandemic.) It’s truly impressive to see the creativity that researchers are applying to this crisis with such speed despite great stress, and it’s inspiring for everyone as we seek solutions to many new challenges. 


Second, another challenge that may be familiar to data people: getting decisionmakers to take action based on insights from data. The great “flattening the curve” visualization and related forecasting seem to have had a strong impact on policymakers and the general public. Likewise, data experts need to be able to communicate about their analyses and models clearly -- for example, with effective data visualizations -- and organizations should build data literacy across all areas. 


Finally, the research I reviewed frequently mentioned the challenge of getting quality COVID-19 data upon which to build good models. Even in normal times, it can be hard to get the kind and quality of data we need. Models for aspects of the pandemic -- or any other phenomenon -- are only as good as the data upon which they are built. Data abounds everywhere, but not all is trustworthy, usable, or relevant. Every organization needs solid data gathering, management, and analytics structures in place (as these epidemiologists recommend for outbreaks). With that preparation, if a crisis of any kind emerges, your responses can be informed by accurate, relevant, up-to-date data readily at hand. 

Susan Currie Sivek
Senior Data Science Journalist

Susan Currie Sivek, Ph.D., is the data science journalist for the Alteryx Community. She explores data science concepts with a global audience through blog posts and the Data Science Mixer podcast. Her background in academia and social science informs her approach to investigating data and communicating complex ideas — with a dash of creativity from her training in journalism. Susan also loves getting outdoors with her dog and relaxing with some good science fiction. Twitter: @susansivek

Susan Currie Sivek, Ph.D., is the data science journalist for the Alteryx Community. She explores data science concepts with a global audience through blog posts and the Data Science Mixer podcast. Her background in academia and social science informs her approach to investigating data and communicating complex ideas — with a dash of creativity from her training in journalism. Susan also loves getting outdoors with her dog and relaxing with some good science fiction. Twitter: @susansivek