Engine Works

Under the hood of Alteryx: tips, tricks and how-tos.
DavidHa
Alteryx
Alteryx

Will It Alteryx Blog-Web Scraping.png

 

And we're back with another edition of the suspenseful series "Will it Alteryx?" Although web scraping is not a new concept with Alteryx, the method shown here is a more efficient and flexible approach. First, for those that aren't familiar with the concept of web scraping, the goal is to extract data from websites to be used for analysis. The data to be extracted is typically in a structured format, most commonly an HTML table. 

 

Many Community posts talk about how to accomplish web scraping with a combination of the Download tool and the RegEx tool. If this approach works for you then by all means keep using this method. However, recently @DavidM produced an article that changed the game, enter the Python Tool. Thanks to the huge support behind Python, the opportunities for what we can use it for are pretty much endless. In @DavidM's article he showed how to use the Python tool with a package called Selenium. Selenium is a project for automating web browser actions, including clicking, typing, waiting, etc.

 

Another option I suggest you consider is BeautifulSoup, which is a Python library specifically for parsing HTML and XML pages. This makes it a great choice for the web scraping ninjas out there!

 

beautiful.png

 

 

Will it Alteryx?

 

Starting your web scraping journey with BeautifulSoup and the Python tool takes no time at all. 

 

First, identify a table on a webpage that you'd like to pull into Designer for analysis. Since fantasy football draft season is upon us, and the running back is the most important position of all (sssh, don't tell anyone else that), I'll be analyzing last year's running back stats to try and understand who I might want to draft this season.  

 

With the web page open, use your browser's web development tool to view the HTML source code. On most browsers, this is available by hitting F12. From here, you can simply navigate through the source code watching the blue highlighting on the left as a guide until you find the code supporting the table in question.

 

chrome - developer tools.png

 

The source providing the table in our example is below. We'll need this in a bit...

 

<table cellpadding="0" border="0" id="data" class="table table-bordered table-striped table-hover tablesorter"> ... </table>

 

Next, simply drag the Python tool (in the Developer category) on the canvas and add the following code to the Python code editor.  

 

# Import packages
from ayx import Package
from ayx import Alteryx

# Note these two lines only need to be executed one time to install the package.
Package.installPackages(['bs4'])       
Package.installPackages(['html5lib'])  

import pandas
import requests
from bs4 import BeautifulSoup

# URL of whatever web page you want to scrape
url = "https://www.fantasypros.com/nfl/stats/rb.php"

page = requests.get(url)
soup = BeautifulSoup(page.text, 'html')

 

Note, the Package.installPackages lines only need to be executed once and require that you launch Designer by right-clicking and selecting Run as Administrator. Run this cell using the Python Run button circled below.  

 

run button.png

 

Assuming no errors, you will now have an object of type BeautifulSoup which contains the HTML page ready for parsing. From the source code we identified earlier, we're looking for an HTML table object with an id of data. BeautifulSoup makes extracting that information a one-line call:

 

# Find a specific table that includes an id attribute
table = soup.find('table', id="data")

 

Some HTML tables have no id attribute. In that case, you can simply find all the tables on the page and then specify which one based on its order of appearance on the page:

 

# for tables without an 'id' attribute, you can simply find all the tables and then grab the one you want. Remember the 1st one will be position '0'.
table = soup.find_all('table')[0]

 

Now that we have the desired table, we need to convert it to a Pandas DataFrame, which is the required format to output from the Python tool to use in an Alteryx workflow. Most solutions you'll find out there for converting an HTML table into a Pandas DataFrame show that you must hardcode the column names, which isn't a very scalable solution. What we want is something that is repeatable regardless of the data being ingested in our workflow. Hello pandas.read_html().  

 

df = pandas.read_html(table.prettify(), header=1, flavor='bs4')[0]
df.head()    

 

A few of points on the read_html() function call above:

  1. The read_html() function returns a list of dataframes, but in our case, there's only one, so we simply grab the first one off the list.  
  2. The prettify() method returns the string representation of the table. We could also use str(table) and accomplish the same thing.
  3. We must specify to use bs4 as the flavor since our table was parsed using BeautifulSoup. Mmm, flavor, soup, I'm getting hungry. 
  4. The 'header=1' option specifies to use the 1st row from the table as the column headers in our DataFrame. I had to specify this in my example since the table had an extra column header (Rushing, Receiving, Misc) above the real column headers:

html_table.PNG

 

If all that processed correctly, our table will be converted to a DataFrame, and we can use the df.head() function to output the first 5 rows for verification:

 

dataframe.PNG

 

With my table now converted to a DataFrame, I can send it to the 1st output anchor on the Python tool with the code below:

 

Alteryx.write(df,1)

 

At this point, we are free to start analyzing the data in question to solve our problem. In my case, the very important problem of understanding what statistics have the biggest impact on fantasy points for running backs.  

workflow.PNG

Final Thoughts

 

The combination of the Python tool with the BeautifulSoup package makes scraping data from the web attainable for everyone and provides for a robust, scalable, repeatable solution. If you've tried this out and found success, problems, or a better approach, please leave a comment and let us know!

 

If you have any technologies you would like to see explored in future installments of the "Will it Alteryx?" series, please leave a comment below!  

 

And lastly, for those wondering, the data shows the highest correlation to fantasy points for running backs is from rushing yards, number of rush attempts, and rush TDs.  Yards per carry is less influential, so go for volume!  

 

running-back-analysis.PNG

 

Now if you'll excuse me, I need to go clear off some space on my bookshelf...

 

trophy.png

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

References

 

David Hare
Senior Manager, Solutions Architecture

David has the privilege to lead the Alteryx Solutions Architecture team helping customers understand the Alteryx platform, how it integrates with their existing IT infrastructure and technology stack, and how Alteryx can provide high performance and advanced analytics. He's passionate about learning new technologies and recognizing how they can be leveraged to solve organizations' business problems.

David has the privilege to lead the Alteryx Solutions Architecture team helping customers understand the Alteryx platform, how it integrates with their existing IT infrastructure and technology stack, and how Alteryx can provide high performance and advanced analytics. He's passionate about learning new technologies and recognizing how they can be leveraged to solve organizations' business problems.

Comments
DavidM
Alteryx
Alteryx

Great article, @DavidHa! Love it

DIFM
5 - Atom

Is there a way to scrape data if one must first input a number into a text field to determine the data to be gleaned?  Also is there any guidance on pagination?

How one might navigate multiple instances of pages containing the data?

 

mc_wallendjack
8 - Asteroid

Great article @DavidHa . Very thoroughly explained!

Joker_Hazard
11 - Bolide

@DIFM Hey. I am having the same issue, have you discovered how to navigate? Thanks