Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Engine Works

Under the hood of Alteryx: tips, tricks and how-tos.
ChrisF
Alteryx Alumni (Retired)

 

Web Scrape Icon

In Part 1 of this series, I talked about the development of a web scraping module that would download historical stock data using a Python script and then host it in a cloud database using Quandl's "Quandl Toolbelt" application. Once I got this version of the module up and running, I realized that there were two big areas for improvement:

 

  1. The macro relied on a Python script to do a lot of the work, but I really wanted to be able to do everything using Alteryx tools.
  2. I had overlooked an important factor in the Google Finance API call that prevented the macro from being able to look up a wide array of stocks.

 

If I wanted to make this tool as versatile as possible, I was going to need to make it more portable (by minimizing the number of external applications needed), and more flexible (by enabling it to search for more stocks).

 

Getting Rid of Python

 

The first issue was a relatively straightforward problem. I was using a Python script to send API queries to Google Finance in an iterative process, and I needed to figure out how to replicate this iterative process in Alteryx. In general, an API query is sent using an API-specific base URL. For Google Finance, the API call begins with the following URL:

 

https://www.google.com/finance/historical?

 

This part essentially just tells Google that you want to request historical finance data from its API. The '?' at the end is what really turns this into an API call as opposed to a regular old web request, since it marks the beginning of what your specific request is going to be. Think of it this way: everything before the '?' is saying "I want to ask the API a question" and everything after the '?' is saying "Alright API, here's what I want to ask you". In this case, a typical API call would look like this:

 

https://www.google.com/finance/historical?q=NASDAQ:AAPL&startdate=01-01-2013&enddate=06-12-2014&output=csv

 

After the '?' you specify the stock you want ('q='), the date range ('startdate=' and 'enddate='), and how you would like the results returned ('output='). The Python script was basically just sending this same request once per stock requested, but each time it would swap out the stock name and date range for whatever you had specified. Once it ran out of stocks to look up, the loop function would spit out all the results as comma-separated tables and then it was up to you to do something with it.

 

Once I thought about it, I realized that I actually didn't need this to be an iterative process if I did it in Alteryx using the Download tool. You see, the Download tool actually handles the iterative process for you. The input for a Download tool is simply a field that contains a URL, but if the URL field has more than one row, each with its own URL, the Download tool will just take the first URL, download all the data from it, and then move to the next URL in the field and repeat the process until it runs out of records. To put it more simply, in the Python script I was re-writing the API call each time and then sending it out to the internet, whereas the with the Download tool all I had to do was give it a unique URL for every stock I wanted and it handled the rest.

 

Once I realized how easy this was, the process actually became embarrassingly simple. The API call needs 3 different parameters in this case:

 

  • stock ticker name ('q=')
  • start date ('startdate=')
  • end date ('enddate=')

 

All three of these are specified by the user in the macro's UI. For example, let's assume the user wants to download historical data for 5 different stocks starting on January 1, 2014 and ending on June 30, 2014. Once they enter these parameters into the UI, the macro has a data stream with 5 records, one for each stock, as well as a start date and end date. At this point, creating the API call is a simple matter of running all 5 records through a Formula tool, formatting each parameter (for instance, the stock parameter becomes 'NASDAQ:'+[Ticker Name] to match the syntax of the original API call), and then combining them all into a URL field with the base API URL at the front. In the end, here's what the formula for the final API URL looks like:

 

'https://www.google.com/finance/historical?q='+[Stock_Names]+'&startdate='+[Start_Date]+'&enddate='+[End_Date]+'&output=csv'

 

What comes out of the Formula tool is simply a list of unique API calls which feed directly into the Download tool, at which time the Download tool will process each URL in sequence. After that, all you have to do is create those Quandl-friendly metadata lines (e.g. code = GOOGL, name = GOOGL Daily Stock Info, etc.) and add them to the top of each record of the output from the Download tool. The end result is exactly what it was before: one comma-separated table of historical data for each stock you requested and the metadata needed to actually upload it to Quandl. Afer that, I simply adapted the Run Command process from the old macro, but instead of writing out a batch file that trigged both a Python script and the Quandl Upload function, I was instead just writing out a comma-separated table for each stock and telling Quandl to upload it. In addition, I wrapped the process in a batch macro so it would execute once for each of the 5 tables, and I'd end up with 5 different Quandl data sets just like before, except this time it took about a tenth of the time and effort to develop and didn't require any other programming languages!

 

Here's what the new workflow looked like:

Macro_Screenshot

But what about the second issue?

 

At the beginning of this blog, I mentioned that there were two issues. I'd solved the Python issue, but upon further testing I realized that I had a big problem: the macro just flat out did not work on a lot of the stocks I tried out. I initially thought it must be that I'd gotten the ticker name wrong, or that maybe the stocks weren't "mainstream" enough to be on Google Finance...but then I remembered it's Google Finance and they've got ALL the stocks. I confirmed this when I tried to run the macro to look up historical data for Twitter(TWTR), a stock that I had no doubt would be tracked by Google, and got no results. Once I dug into it more, I realized that there was a fundamental flaw in my whole approach to building the API call.

 

Recall that in both previous versions of the macro, every stock name in the API call was preceded by 'NASDAQ:' so that you'd end up with something like 'q=NASDAQ:AAPL'. This works fine when you're looking for a stock that is actually traded on the NASDAQ exchange, but Twitter is traded on the NYSE, so 'NASDAQ:TWTR' doesn't actually mean anything to Google, hence the lack of results. If I wanted to be able to use this macro to look up all sorts of stocks, I'd have to know which exchange each stock was traded on.

 

At this point, I had a decision to make. The first option was to ask the user to figure out the exchange for each stock themselves and incorporate it into the UI configuration (.e.g. if the user wanted to look up TWTR, they would now have to go find which exchange it was traded on and type 'NYSE:TWTR' into the UI). In my opinion, this is a terrible option. The whole point of this was to make getting to the data as easy as possible for the user, and I'm of the opinion that if there's a tricky problem to solve in order to get the macro to run correctly, then I owe it to the user to figure that out myself and save them the trouble.

 

The second option was that I continue to only ask the user to give me the stock ticker name and then have the macro figure out which exchange it's traded on. This had the dual benefit of making the macro as accessible as possible while also giving me the opportunity to solve another interesting problem. The solution was to actually turn the macro into a two-stage web scrape. By way of example, let's take a look at the Google Finance result when we search for "TWTR":

 

Google Finance Screenshot

 

When you actually go to Google Finance and search for a stock, all you have to do is type in the ticker itself and it takes you to the stock's main Google Finance page. However, when you take a closer look, you'll notice a very useful piece of information sitting right there at the top: We've searched for "TWTR", but Google is showing us "Twitter Inc (NYSE:TWTR)"

 

What this means is that it's now possible for us to take the original user input and find the exchange that it's traded on by actually downloading the HTML source for each page, parsing out the "NYSE:TWTR" piece of it, and then using that as the new [Stock_Name] field in the Formula tool. So now instead of defining [Stock_Names] as 'NASDAQ:'+[Ticker_Name], we simply define it as [Stock_Name], which already has the exchange information built in! From this point, we feed the newly scraped stock names into the original Formula tool, construct the API call, and send the information through the rest of the macro just like before.

 

Warning: RegEx Incoming

 

First off, if RegEx drives you up the wall or you're just not all that interested in the nuts and bolts of parsing data, feel free to skip ahead to the next section, as no one (not even me) would blame you. A co-worker (and good friend) of mine has often lovingly referred to Regular Expressions as 'The Dark Art' because they're powerful enough to either save the day or ruin everything, and sometimes the difference that determines which result you end up with is very small. I prefer to think of RegEx as more of an 'Arcane Art': it's powerful, it makes no sense when you see it for the first (or hundredth) time, and how you use it is totally up to you. I love it though, and I figure I can't be the only one, so I want to take a second to discuss how the new version of the macro is 'figuring out' which exchange each stock is traded on, and to also show that it isn't nearly as scary as one might think.

 

The basic process is as follows:

 

  • Download the page source for each stock's home page on Google Finance
  • Yank out the chunk of HTML that actually has the relevant information
  • Continue to pare down the chunk of code until you're left only with the data you're interested in

 

So, first off, we download the page source. Here's an excerpt from the page source that we get search Google Finance for 'TWTR':

 

<!DOCTYPE html><html><head><script>(function(){(function(){function e(a){this.t={};this.tick=function(a,c,b){var d=void 0!=b?b:(new Date).getTime();this.t[a]=[d,c];if(void 0==b)try{window.console.timeStamp("CSI/"+a)}catch(e){}};this.tick("start",null,a)}var a;window.performance&&(a=window.performance.timing);var f=a?new e(a.responseStart):new e;window.jstiming={Timer:e,load:f};if(a){var c=a.navigationStart,d=a.responseStart;0<c&&d>=c&&(window.jstiming.srt=d-c)}if(a){var b=window.jstiming.load;0<c&&d>=c&&(b.tick("_wtsrt",void 0,c),b.tick("wtsrt_",
"_wtsrt",d),b.tick("tbsd_","wtsrt_"))}try{a=null,window.chrome&&window.chrome.csi&&(a=Math.floor(window.chrome.csi().pageT),b&&0<c&&(b.tick("_tbnd",void 0,window.chrome.csi().startE),b.tick("tbnd_","_tbnd",c))),null==a&&window.gtbExternal&&(a=window.gtbExternal.pageT()),null==a&&window.external&&(a=window.external.pageT,b&&0<c&&(b.tick("_tbnd",void 0,window.external.startE),b.tick("tbnd_","_tbnd",c))),a&&(window.jstiming.pt=a)}catch(g){}})();})();
</script><title>Twitter Inc: NYSE:TWTR quotes & news - Google Finance</title><meta name="Description" content="Get detailed financial information on Twitter Inc (NYSE:TWTR) including real-time stock quotes, historical charts & financial news, all for free!"><meta http-equiv="X-UA-Compatible" content="IE=10"><link rel="canonical" href="http://www.google.com/finance?cid=32086821185414"><link rel="stylesheet" type="text/css" href="/finance/f/finance_us-3578168175.css"><link rel="stylesheet" type="text/css" href="/finance/_/ss/a/ver=1rfoblf25stm8/am=!gly3Zx38QovgNqhW/bf=BA/r=O?k=1"><link rel="icon" type="image/vnd.microsoft.icon" href="/finance/favicon.ico"><style>@-webkit-keyframes gb__a{0%{opacity:0}50%{opacity:1}}@keyframes gb__a{0%{opacity:0}50%

 

There are many ways to go about this process, but I like to do this kind of thing in steps, so I start by finding the data I'm interested in (in this case, 'NYSE:TWTR') and then looking for something that uniquely identifies where this data is going to be. In simpler terms, we're looking for a landmark that will ALWAYS point to where the exchange:ticker combination is. In this case, that landmark is the <title> tag:

 

<title>Twitter Inc: NYSE:TWTR quotes & news - Google Finance</title>

This tag is valuable for two reasons:

  1. It's the title, so we can reasonably assume that this same tag will be in every other Google Finance stock page (we are looking for patterns, after all).
  2. The data contained within the title tag is short and sweet, which means we won't have to do much more once we pull it out of the HTML.

 

I use the following expression to pull the title tag (and its contents) out of the page source for each stock page:

 

(?<=<title>)(.*?)(?=<\/title>)

Which yields (in this example), a field that contains the following:

 

Twitter Inc: NYSE:TWTR quotes & news - Google Finance

Clearly, this is a significant improvement over the original HTML. At this point, all we need to do is strip away all the fluff in the title so that all we're left with is 'NYSE:TWTR'. To do this part, I use the Formula tool instead of the RegEx Parse tool, but this is really just a matter of personal preference. You could do it with another RegEx Parse tool if you'd like. Here's the formula I used:

 

REGEX_REPLACE([Stock_Names],'.*\:\s(.*?\:.*?)\s.*','$1') 

The short explanation of this is that the formula is basically describing what the entire string will look like while also indicating which part you're interested in (the part in parentheses, i.e. (.*?\:.*?)), and then saying "Replace this entire field with just the part I'm looking for". In the end, we're left with exactly what we wanted, 'NYSE:TWTR'.

 

Here's a snapshot of what that piece of the workflow looks like:

HTML Parsing

Final Thoughts

 

To be honest, my favorite part about this is just how dynamic and self-sufficient the whole process is. Using Alteryx's many tools (and a little bit of RegEx), we can start with VERY minimal input from the user and let the Alteryx process itself not only do all the downloading and storing, but also go and find whatever missing information it needs in order to actually initiate those downloads. To put it more simply, any time you can get Alteryx to do (almost) all the work for you, life is good.

 

I've attached the final version of the module and this time I've left the Quandl Upload function in (although it's still disabled by default within each macro, so you'll need to enable it and add a macro output if you want to mess around with this part of it). It's important to keep in mind that in order to run this module with the Quandl functionality you will need to set up a Quandl account and install the Quandl Toolbelt application. If you're interested in that, more information can be found here. Otherwise, you can just run it as is and you'll be able to see how everything prior to the Quandl part works.

 

It's my hope that this blog series will serve to (at least somewhat) demystify the idea of web scraping in general, and also show just how effect Alteryx can be at tackling a problem like this. In the future, I may add a few more parts to this series and discuss what we can do with all that data once it's being stored in the cloud, so keep an eye out.

 

Thanks for reading.

 

-Chris

Comments