The Johns Hopkins University Center for Systems Science and Engineering has released the data they use to power their 2019 Novel Coronavirus Visual Dashboard. Since I was curious about exploring that underlying data to better understand the spread of the COVID-19 virus and seeing how Alteryx might be used to analyze that data for predictive purposes, I made a workflow to do just that.
The data is stored in a GitHub repository at https://github.com/CSSEGISandData/COVID-19. If you're familiar with git, clone that repository somewhere on your system using your favorite git client. For those that are new to git, installing Git for Windows is an easy and free way to get started. Once that is installed, open a command prompt and navigate to a folder where you'd like to store the data. I keep all my data sets at C:\Data\. Once there, run this command to download the data from GitHub to your local system. This will create a new folder called COVID-19.
git clone https://github.com/CSSEGISandData/COVID-19.git
With that created, save the attached YXZP package to your system and double-click to install it with Alteryx Designer. Select "Yes" to continue installing the package. When the dialog below appears, change the destination directory to the COVID-19 folder created in the previous step. For example, since I save my data in C:\Data\, I'd set the destination directory field to C:\Data\COVID-19. Click Import to finish this process.
The data is read from a set of CSV files in the csse_covid_19_data\csse_covid_19_daily_reports\ subdirectory. A batch macro is used to perform the import since after a certain date, latitude and longitude fields were appended to the data set. A Formula tool is then used to parse the Last Updated field, since on a different date, the format of the date and time the record was last updated changes. A Sort tool followed by two Multi-Row Formula tools are used to populate null Latitude and Longitude fields with corresponding values from matching records before them. Finally, a Select, Auto-Field, Data Cleansing, and second Sort tool are used to do some cleansing of the data. All null strings are replaced with empty strings and all null integer fields are set to 0. The few null Latitude and Longitude fields are left as-is. The workflow generates a YXDB output.
The final data output contains the following fields:
Name |
Type |
Description |
Country/Region |
V_String |
Name of the country or region in the world where the data was reported |
Province/State |
V_String |
Name of the province or state in the country where the date was reported. Can be empty if the data is reported only at the country or region level |
Updated |
DateTime |
The date and time the record was last updated |
Confirmed |
Int32 |
Number of confirmed COVID-19 cases reported |
Deaths |
Int16 |
Number of COVID-19 cases resulting in deaths |
Recovered |
Int32 |
Number of COVID-19 cases resulting in recovery |
Latitude |
Double |
Latitude of centroid of the reporting area |
Longitude |
Double |
Longitude of centroid the reporting area |
To use the data with spatial tools, an appropriate replacement for the null values in the latitude and longitude fields will have to be set. Fortunately, less than 1% of the values in those fields are null.
The GitHub repository is updated daily by JHU CSSE. To keep your local copy up to date, open a command prompt, navigate to the directory where the data is stored (e.g., C:\Data\COVID-19\), and run the following command:
git pull
If you have questions about using the workflow, or want to share what you create or discover with the data, please share in our General Discussion forum thread.
JHU_COVID-19_Daily_Import.yxzp
JHU_COVID-19_Daily_Import2.yxzp