Johns Hopkins University CSSE released the data set that powers their dashboard on GitHub (https://github.com/CSSEGISandData/COVID-19). If you want to work with that data easily, I created the attached workflow and macro to import the daily data found within that repository. Install the package to the root of the folder that Git creates. This will import the daily data, parse the date fields that change formats halfway through the timeseries, update null latitude and longitude fields, and other general cleansing. With that done, you can experiment with daily worldwide COVID-19 confirmed/deaths/recovered case numbers at the country/region and province/state level, with geocoding available for about 99% of records.
If this workflow is useful, please let us know. If you need help or have improvements to the workflow, please share.
We'd also love for you to share what you create or discover by replying to this thread!
EDIT: The workflow has been updated to better clean and regularize the data. A lot of clean up is being now to country and state names, with merging of duplicates being done, and a locality field being parsed out of values such as "Chicago, IL". Review the new workflow for details. This should improve the quality of the output data significantly, although JHU is still working on upstream issues on their end.
@shevshenko
Did you consider just the last day row for each country?
Because, if you sum all row per country you'll get distorted number. Example, day 1, 3 confirmed, day 2, 5 confirmed, the correct amount of confirmed is 5, but you sum will get 8 confirmed.
Check if is not this.
@klaus01 That's correct, the data is a running total, so the most recent day is the current total. So, @shevshenko, don't sum them, but take only the day that is relevant for your purpose (i.e., most recent day if you want a current dashboard).
When I upload the revised version of the workflow, I'll edit my original post to make this more clear.
@nickbecks and others:
Here's a new version of the workflow that addresses the radical change in schema that JHU introduced starting with the 22 March data file. They've cleaned up column names, which is nice, and also added a FIPS field with the numeric code for US counties, and Admin2 field with the name of an administrative region below province or state level (this is county level for the US), and added an Active field which represents the current active cases as of the Last_Updated value. This is calculated as Confirmed - Deaths - Recovered.
Because of the number of changes in the schema, the original batch macro is removed. Otherwise, use as you would the original workflow.
@DavidW thanks, this is great.
Is there a way to pass file name as a field output from the Dynamic Input tool? I tried and failed...
I'm looking to use file name as date rather than last updated date. Using last updated date results in gaps when a specific region does not have a data point for a given date. For example many regions (US, Spain, Italy, Germany, France, etc) don't have a last updated date record for 3/13/20, so if you were to sum total cases on 3/13/20 you would end up with a misleading result.
Well, there are some gaps in the CSV files that are used as an input if you run the Alteryx Workflow, I can tell you that the files are in the folder when you use the GIT Pull command, but you will face some issues if the Column names are different that the one used as a Template. So check the column names first otherwise the file will be skipped and it will not be considered as part of your totalss
@shevshenko Are you using the new workflow?
@nickbecks Attached is a new version of the workflow that adds the FileName field from the Dynamic Input tools to the output. That should work for what you are trying to do.
Yes, I did use the updated version but still got the headers warnings, I will double check but I did the manual change in the second select tool:
More sources are being made available, so I figured I'd contribute other options.
New York Times: In response to a number of requests, the New York times is releasing the data they have used for their own reporting. Details can be found on their github here. The data starts from the first reported case in Washington state as of 2020-01-21.
This workflow utilizes one python tool to pull the data in csv form (simpler to transform into a table using pandas), and then requires a few small modifications to include latitude and longitude where none was provided.