Engine Works

ChrisF · ‎09-08-2014

This is a discussion for how to implement python code into an Alteryx workflow. For an example of how to web scrape in a code free environment, please skip ahead to part 2.

Talking about Alteryx with anyone who has ever dealt with data for a living tends to begin the same way each time. Almost invariably, the conversation starts with a discussion of Alteryx’s ability to manipulate data into all sorts of usable forms, whether it’s a back-end ETL process, a multi-table join, connecting to a database, or just simply reading in multiple (and disparate) file types. While there’s no doubt that Alteryx enables its users to do all sorts of other interesting and useful things, its ability to do the heavy-lifting when it comes to the data is what always seems to grab people first, and for good reason. When it comes to data analysis, finding all the data you need and munging it into a form where you can actually DO something with it is almost always the most time-consuming part of the process.

That said, a lot of these conversations involve the same basic process: there's some sort of (relatively) structured data "out there" in a file or database that someone needs to import and analyze. In many cases, the data analyst typically has the luxury of working with “ready-made” data from a controlled data warehouse. But what if that's not the case? What if the data has yet to be fully collected, much less structured, and you need to get to it without driving yourself crazy? More often than not, this is the case when it comes to trying to pull data off the internet and turn it into something a little more analytics-friendly, a process known as “web scraping”. It turns out that between the Download tool and all its great data manipulation tools, Alteryx actually makes for a really good web-scraping platform in addition to its wealth of other uses.

The Initial Idea

My adventures with web-scraping in Alteryx actually began as a simple excuse to force myself to learn to work with the Run Command tool, and also learn some programming in the process. I studied economics in school, and so (in an effort to stay at least a little bit inside my comfort zone) I decided to try come up with some kind of web scraping project that would involve the stock market. My initial idea was to make a stock portfolio tracker that would let you tell it what stocks you owned (along with initial buy price, number of shares, etc.) and would then download and track historical trading information for each stock in your portfolio. More conceptually, the project had several main goals:

Scrape finance data off the web using some sort of iterative download/parse function
Store the scraped data in the cloud in a way that allowed me to access it later
Architect this entire process using Alteryx so that it could all be run as one module and configured dynamically

And so the research began. My first task was to figure out how to actually "do" web scraping since I was completely new to the idea. Luckily, lots of other people have had similar inclinations in the past, and they were kind enough to publish their experiences on the internet for me to eventually find and learn from. While it was obvious that there were all sorts of options, I settled on Python since it seemed fairly intuitive right from the start, and also all the cool (data analyst) kids just wouldn't shut up about it. After watching a whole bunch of Youtube videos and reading about 75% of this book over the course of a very nerdy weekend, I started fiddling around in a Python IDE and eventually came up with something that would take some user-input parameters and go download stock data from Google Finance for every stock that had been requested.

Storing the Data

Getting the data was a good start, but I still needed to find a place to put it since I was really looking to maintain and manipulate this data as well as update it on a regular basis. Conveniently, it was around this time that my manager showed me this fantastic website called Quandl. At first glance, Quandl is a massive repository for all sorts of time series data. You can find data on things from population by state to GDP per capita for every country in the world to daily CPI values for Canada, and you can easily download the data in all sorts of formats. More importantly, Quandl also had another lesser-known feature: the "Quandl Toolbelt", which essentially lets you create and host your own data sets on their servers with (relatively) minimal effort on your part. All you needed was a .csv file containing the data you wanted along with a few metadata lines at the top of the table to tell Quandl what to actually do with the .csv once you uploaded it. The best thing about it was that you could run the upload process repeatedly and Quandl was smart enough to figure out if you were just sending it the same data over and over (in which case it would justifiably ignore you) or you were actually sending it an updated version of the data (e.g. a new version of a table that had more recent data added to it), in which case it would update the existing table and add the new data.

So at this point, I updated the Python script to add the requisite metadata lines to each stock price table when it downloaded them so I could easily upload them to Quandl once it was done. Here's what it ended up looking like when all was said and done:

import urllib.request
import re
import sys
import codecs

### Create input parameters based on system arguments ###

user_args = sys.argv[1:]

#user_args = ['01-01-2013','05-20-2014','AAPL','GOOGL']

dates = user_args[0:2]

#dates = ['01-01-2013','05-20-2014']

start_date= "&startdate="+dates[0]

end_date= "&enddate="+dates[1]

tickers = user_args[2:]

#tickers = ["AAPL","GOOGL"]

### The Actual Download Function ###

i=0
while i<len(tickers😞
    url = "https://www.google.com/finance/historical?q=NASDAQ:"+tickers[i]+start_date+end_date+"&output=csv"
    stock_quote_download = urllib.request.urlopen(url).readlines()

    print ("code: "+tickers[i])
    print ("name: "+tickers[i]+" Daily Stock Info")
    print ("private: false")
    print ("---")

    for row in stock_quote_download:
        if row.startswith(codecs.BOM_UTF8😞
                row=row[3:].decode('utf-8')
        else:
                row=row.decode('utf-8')

        print(row.strip())

    i+=1

The download function itself is a fairly basic "while loop" that queries the Google Finance API, downloads a .csv file of historical price information for each stock and date range you specify, and then returns the comma-separated data with a few metadata lines added to the top. The real trick was getting this script to work inside of an Alteryx app where I could update the script parameters based on user input from the Alteryx UI side of things. This is where the Run Command tool became very handy.

Bringing It All Together

In order to configure the Python script from within Alteryx, my solution was to have Alteryx take all of the user inputs and include them in a command string that would then be written out to a batch (.bat) file. I used a Formula tool to manipulate all the necessary inputs into the correct format and ultimately construct the command string, and then, using the "Block Until Done" tool, I set the whole module up so that a Run Command tool would execute the batch file as soon as Alteryx was finished writing it out. Here's an example of what the contents of a typical batch file would look like:

python .\Supporting_Macros\google_scrape.py 01-01-2012 05-23-2014 AAPL GOOGL | quandl upload

And here's what the Alteryx workflow looks like:

The end result is that I was able to execute the Python script from the command line, and dynamically update the function parameters using all the system arguments that come after the initial script is called (i.e. the dates and the stock names). The Python script would begin by reading all of those arguments in and then creating the different parameters based on what it received. This way you could have a user enter a list of stocks and a date range in an Alteryx menu, and (unbeknownst to them) they'd actually be configuring a Python script indirectly. The final portion of the batch file, "| quandl upload" is actually taking the results of the Python script and uploading it all to Quandl so that I could reference it later. Since the Python script iterated once for each stock, the quandl upload function would create a separate table per stock, along with a very neat summary page, each of which looked something like this:

An example of the final result in Quandl can be found here.

So, this was awesome. I had this cool Python script, I was using cool-sounding phrases like "execute from the command line" on a regular basis, and I had a way to store my data in the cloud, but something was nagging at me. As fun as Python was, I strongly believe that Alteryx is essentially a programming language in and of itself, and I really felt like I should be able to do this stuff in Alteryx without needing to rely on another language.

It turns out my instincts were right, and in part 2 of this blog series I'll discuss how I went about rebuilding this same process using nothing but Alteryx tools, and how I was ultimately able to actually improve the functionality of the whole module in the process.

I've included a module that contains the initial Python version of the macro. Note that I have removed the "| quandl upload" portion of the command string simply because it requires a fair amount of configuration on the back-end in order to work and the macro would error without it. As it stands, you will still need Python 3 installed if you want to run it, but even if you don't, I encourage you to at least open it up and look at how everything fits together.

Thanks for reading.

-Chris

Engine Works

Web Scraping with Alteryx, Part 1

The Initial Idea

Storing the Data

Bringing It All Together