The following post was written by Joseph Lombardi, an intern working with the Alteryx Advanced Analytics team.
As with any real-world data source, nothing is perfect. Things can be dynamic, unsorted, or even encrypted, so it is useful to have the tools and skills to work around these issues. Alteryx comes in handy, as it gives its users the customization and data modification tools to make this problem-solving go a lot quicker.
There are three main topics covered:
The real world example used to address these can be found here: http://thedataweb.rm.census.gov/ftp/cps_ftp.html
The following three workflows show a working example of this, available as a module here (or at the bottom of this post) to import into Alteryx
Preparation
The first step is to ensure that Alteryx has the R Tool available by going to "Help" > "Install predictive tools". This can also be done by navigating to http://downloads.alteryx.com/Latest_RInstaller.htm . It may help to become familiarized with the 'Download' tool as well.
This whole project can be downloaded and ran with sample entries: CPS Download Decompress Open.yxzp
The example that will be used is the Current Population Survey (CPS) data found on http://thedataweb.rm.census.gov/ftp/cps_ftp.html
After downloading 1 month's data zip file, extracting, and opening the data with a text editor, the data appears to be fixed-width. The site also includes human-readable dictionaries for these files. The default method for Alteryx to open fixed-width data files is with a formatted .flat file dictionary, so a transformation is necessary.
In summary, what needs to be accomplished in the Alteryx workflow:
The R Tool will play a big roll in allowing users with some coding experience to solve many issues that accompany working with real-world and minimally structured data sources.
Download
The Download Tool downloads data as either an input field in Alteryx, or saves to a file. Every link in HTML is of the form
> <a href="http://example.com/link.html"> ...
This helps parsing the HTML of http://thedataweb.rm.census.gov/ftp/cps_ftp.html, which can be parsed using a Text To Columns Tool to generate rows with the delimiter: **"** (essentially leaving row entries with either link addresses or unused HTML.) The result would be
|... | |----| |<a href=| |http ://example.com/link1.zip| |> </div> <a href=| |http ://example.com/link2.zip| |...|
Simply using a Filter Tool to find links with desired extensions -- .zip for compressed data and .txt for dictionaries -- it becomes possible to feed these links to another Download Tool to get the desired data.
For further steps, it helps to use the structure of the directory naming conventions to match dictionaries with corresponding data files using the Find Replace Tool to append which dictionary goes with each .zip.
> For example,
> *.../basic/201301-/January_2013_Record_Layout.txt*
is the dictionary for all files under
>*.../basic/201301-/*
Extraction
This step requires the R Tool, or some use of an external program through the Run Command Tool.
The method using the R Tool is an easy application of the unzip() command in R:
```R filepath = read.Alteryx("#1", mode="data.frame") ## takes in the file path after it has been made into rows by a \ delimiter directory = paste(filepath[1:nrow(filepath)-1,1],collapse = "/") ## removes the file name, and reconcatenates it into a directory path directory = gsub("·"," ",pathy) ## makes sure spaces are handled correctly zipfile = read.Alteryx("#2", mode="data.frame")[[1,1]] ## filepath to the compressed data zipfile = gsub("·"," ",zipfile) ## makes sure spaces are handled correctly unzip(zipfile,exdir = directory) ## uncompresses the data. ## a directory is required since the working directory is a temporary folder, ## and not the folder the module is contained in. ```
Formatting a .Flat file
Putting together an R script to do this requires knowing what makes up a .flat file. Exporting a quick sample table from Alteryx (below) gives the following .flat file:
|a |b |Breed| |----|----|----| |1 |2 |cat | |3 |4 |dog | example.flat: ```HTML <flatfile version="1"> <file path="C:\Users\Administrator\Desktop\example.asc" ## This filepath changes eoltype="crlf" /> <fields> <field name="a " type="Byte" length="3"/> ## These <field name="b" type="Byte" length="3"/> ## Column Descriptors <field name="Breed" type="String" length="3"/> ## Change </fields> </flatfile> ```
So now all that is left to get the data into Alteryx is to write an R script for the R Tool that writes a file in the previous format, substituting information such as the filepath and dictionary entries. Three things must be handled: the field name, the field type, and the length of the field. A working script to do this is contained in the module file at the top of the page.
Once the data is readable by Alteryx, it is reopened as a .flat with the filepath handy,and saved as an Alteryx database (.yxdb) for further use.
*written by* Joseph Lombardi
Tara McCoy is the Creative Director at Alteryx. Since joining Alteryx in 2004, Tara has held roles in Product Management, Content Engineering, and Community where her focus has always been on delivering an amazing product experience with Alteryx. From designing product icons and community badges, procuring excellent swag, concocting engaging contests, and crowdsourcing content from internal and external Alteryx advocates alike, Tara is dedicated to the Alteryx brand and wants everyone to experience the thrill of problem solving with Alteryx.
Tara McCoy is the Creative Director at Alteryx. Since joining Alteryx in 2004, Tara has held roles in Product Management, Content Engineering, and Community where her focus has always been on delivering an amazing product experience with Alteryx. From designing product icons and community badges, procuring excellent swag, concocting engaging contests, and crowdsourcing content from internal and external Alteryx advocates alike, Tara is dedicated to the Alteryx brand and wants everyone to experience the thrill of problem solving with Alteryx.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.