Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

General Discussions

Discuss any topics that are not product-specific here.

Johns Hopkins University COVID-19 daily data workflow

DavidW
Alteryx Alumni (Retired)

Johns Hopkins University CSSE released the data set that powers their dashboard on GitHub (https://github.com/CSSEGISandData/COVID-19). If you want to work with that data easily, I created the attached workflow and macro to import the daily data found within that repository. Install the package to the root of the folder that Git creates. This will import the daily data, parse the date fields that change formats halfway through the timeseries, update null latitude and longitude fields, and other general cleansing. With that done, you can experiment with daily worldwide COVID-19 confirmed/deaths/recovered case numbers at the country/region and province/state level, with geocoding available for about 99% of records.

 

If this workflow is useful, please let us know. If you need help or have improvements to the workflow, please share.

 

We'd also love for you to share what you create or discover by replying to this thread!

 

EDIT: The workflow has been updated to better clean and regularize the data. A lot of clean up is being now to country and state names, with merging of duplicates being done, and a locality field being parsed out of values such as "Chicago, IL". Review the new workflow for details. This should improve the quality of the output data significantly, although JHU is still working on upstream issues on their end.

David Wilcox
Senior Software Engineer
Alteryx
52 REPLIES 52
klaus01
5 - Atom

@shevshenko

 

Did you consider just the last day row for each country?

Because, if you sum all row per country you'll get distorted number. Example, day 1, 3 confirmed, day 2, 5 confirmed, the correct amount of confirmed is 5, but you sum will get 8 confirmed. 

Check if is not this. 

 

DavidW
Alteryx Alumni (Retired)

@klaus01 That's correct, the data is a running total, so the most recent day is the current total. So, @shevshenko, don't sum them, but take only the day that is relevant for your purpose (i.e., most recent day if you want a current dashboard).

 

When I upload the revised version of the workflow, I'll edit my original post to make this more clear.

David Wilcox
Senior Software Engineer
Alteryx
klpenney_Raytheon
6 - Meteoroid
You are correct there are global lags. It is really immaterial as long as you are not over counting via too many duplications. Be rational.

Keith L. Penney
Data Analyst, Sr.
TEKWISSEN Contractor
Supply Chain Transformation
Space and Airborne Systems (SAS)
keith.penney-nr@raytheon.com
Desk: 972-344-9349
Mobile: 901-603-2906
DavidW
Alteryx Alumni (Retired)

@nickbecks and others:

 

Here's a new version of the workflow that addresses the radical change in schema that JHU introduced starting with the 22 March data file. They've cleaned up column names, which is nice, and also added a FIPS field with the numeric code for US counties, and Admin2 field with the name of an administrative region below province or state level (this is county level for the US), and added an Active field which represents the current active cases as of the Last_Updated value. This is calculated as Confirmed - Deaths - Recovered.

 

Because of the number of changes in the schema, the original batch macro is removed. Otherwise, use as you would the original workflow.

David Wilcox
Senior Software Engineer
Alteryx
nickbecks
6 - Meteoroid

@DavidW thanks, this is great. 

 

Is there a way to pass file name as a field output from the Dynamic Input tool? I tried and failed...

 

I'm looking to use file name as date rather than last updated date. Using last updated date results in gaps when a specific region does not have a data point for a given date. For example many regions (US, Spain, Italy, Germany, France, etc) don't have a last updated date record for 3/13/20, so if you were to sum total cases on 3/13/20 you would end up with a misleading result.

shevshenko
7 - Meteor

 

Well, there are some gaps in the CSV files that are used as an input if you run the Alteryx Workflow, I can tell you that the files are in the folder when you use the GIT Pull command, but you will face some issues if the Column names are different that the one used as a Template. So check the column names first otherwise the file will be skipped and it will not be considered as part of your totalss

 

Column_Name.JPG

DavidW
Alteryx Alumni (Retired)

@shevshenko  Are you using the new workflow?

David Wilcox
Senior Software Engineer
Alteryx
DavidW
Alteryx Alumni (Retired)

@nickbecks Attached is a new version of the workflow that adds the FileName field from the Dynamic Input tools to the output. That should work for what you are trying to do.

David Wilcox
Senior Software Engineer
Alteryx
shevshenko
7 - Meteor

Yes, I did use the updated version but still got the headers warnings, I will double check but I did the manual change in the second select tool:  

 

Column_Name_Change.JPG

neilgallen
12 - Quasar

More sources are being made available, so I figured I'd contribute other options.

 

New York Times: In response to a number of requests, the New York times is releasing the data they have used for their own reporting. Details can be found on their github here. The data starts from the first reported case in Washington state as of 2020-01-21.

 

This workflow utilizes one python tool to pull the data in csv form (simpler to transform into a table using pandas), and then requires a few small modifications to include latitude and longitude where none was provided. 

 

 

 

Labels