community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Python Code Tool - Web Scraping Dynamic Websites Using Selenium

Highlighted
Alteryx
Alteryx

Hey everyone,

 

Let me start with the bottom line here. I like to be practical:

 

This article is about using the new Python Code tool to help you download pretty much any data you need from a website of your choice.

Even if dynamically generated (DOWNLOAD tool won't do) and use it wherever you need in your workflow.

As a by-product, we are actually demonstrating one of the ways that the new Python Code tool is super powerful to give you even more flexibility in Alteryx.

 

Workflow with all this is attached to the post.

 

image.png

 

Cool. Now that the practical part of me got that bottom line thing out, let me try to share my thought process here (and a few links to great resources). Here goes:

 

Just recently Alteryx announced the release of 2018-3. One of the amazing new features as part of that release is the Jupyter Notebooks integration for Python.

I am not the one to shy away from a bit of coding. I did a bit of SQL for living for about 7 years (read as: a couple of thousand lines a month and dreaming about code). I also picked up some HTML/CSS/JavaScript and Java as well here and there.

 

But Python got me intrigued. I have started with it not a long time ago and it always amazes me how easy it is to pick up and do things.

The fact that the Python is an open platform and you can easily get millions of super handy packages is just brilliant. Makes it easy to do super cool stuff.

 

Let me share one of the use cases for Python to be used with Alteryx Designer that I went through recently with one of our customers.

 

Use case

Download pricing information of some bonds from a trading website. (Note: this is not the actual customer's ticker. Using this one as an example only).

This use case can be well generalized to downloading just about any data from any website.

 

Use case details

As an analyst in a bank, for instance, you typically need to access a website for some kind of trading information several times a day.

So it only makes sense to use Alteryx Designer to get that for you and automate that process. Especially if you need that input to work with in your workflow.

 

How to do it

The first choice for similar tasks is to typically use the DOWNLOAD tool with the REST API of the website/ source of your choice.

Yes, Google Finance or Yahoo Finance websites will have this option in place OK. But not every website out there expose their APIs.

 

How to do it 2

These large websites do not typically quote those non-world-wide financial products information. Say some bonds/shares that are issued only in one country or so.

For this reason, you may be left with one or two websites that actually quote the information you need but don't have the REST API.

In that case, you can typically use the DOWNLOAD tool in Alteryx with the GET method to download the HTML content of such website.
And typically combine this with some REGEX tool to parse the data you need from that code.

 

Wee bit of Problem

What if the above suggested does not work though? You don't have REST API for the website of your choice.

And the website is actually dynamically generated. That means using something like asynchronous JavaScript to generate the content.

This means that when you try to web scrape the webpage using GET with DOWNLOAD tool, it will get you some (but not all) code and a ton of placeholders.

 

Solution

Use the new Python Code tool withing Alteryx Designer to help you do the web scraping with Python package called Selenium to scrape a website and push it back (full HTML) into a workflow for the downstream processing. Then, in your workflow use something like REGEX tool to parse/ tokenize the data you need.

 

"This way you can easily get the data you need from the website of your choice (even if dynamically generated) and use it wherever you need in your workflow."

 

Why is it cool?

As I said - the Download tool in Designer comes with restrictions. If a web is using asynchronous methods of JavaScript or is generally Dynamically generating content (some kind of a web app), the download tool just won’t do. You will get just part of your HTML code typically with placeholders or a lot of script only. No data.

 

With the workflow I built, you use dev testing package called Selenium to open a webpage in a browser automatically, wait until the page is actually fully generated (based on an ID of some tag) and then get the full page scraped then.

 

What is Selenium?

Selenium automates browsers. That's it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes but is certainly not limited to just that. Boring web-based administration tasks can (and should!) be automated as well. More at https://www.seleniumhq.org/

 

Selenium is supported with Python. You can just download it as a package and use it in your code. Really just waiting for you out there.

More at https://selenium-python.readthedocs.io/installation.html#introduction

 

The actual workflow

To make this work, you need to install two packages (urllib3, selenium).

 

Plus you need to install the ChromeDriver to support Selenium http://chromedriver.chromium.org/downloads.

The code is quite simple. First, you need to install the urllib3 and selenium packages first.

  

from ayx import Alteryx

#install the selenium and urllib3 packages
#you may need to elevate privileges for running Designer to admin for doing this
Alteryx.installPackages("urllib3")
Alteryx.installPackages("selenium")

And now the actual code to do the web scrape. 

Make sure you point to that Chromium driver you downloaded before. Just as well replace the URL I was using to the URL of your choice.

 

This will then use Selenium to open that website, wait for a load of a certain element (get all dynamically generated content) and push the code to the number 1 output from the Python tool.

 

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import pandas

# Start the WebDriver and load the page
# Using Chromium Driver here, need to change path to match youe env
wd = webdriver.Chrome("C:/Program Files/Alteryx/bin/Plugins/chromedriver")

# Enter URL you want to scrapewd.get("https://www.six-group.com/exchanges/bonds/security_info_en.html?id=DE000A19W2L5EUR4")

# For dynamically generated websites wait for a specific ID tag
# We are waiting for SIP_OV_ClosingPrice in this example
delay = 3 # seconds
try:
    myElem = WebDriverWait(wd, delay).until(EC.presence_of_element_located((By.ID, 'SIP_OV_ClosingPrice')))
    print ("Page is ready!")
except TimeoutException:
    print ("Loading took too much time!")

# And grab the page whole HTML source
html_page = wd.page_source

# Attempt to close chromium instance
wd.quit()

#Print the HTML page
print(html_page)

#Turn the variabe with html page into Pandas' DF
df = pandas.DataFrame({"html_page":[html_page]})

#Write the data frame to Alteryx workflow for downstream processing
Alteryx.write(df,1)

 

Note that you also need to change that piece here. This says - delay the processing until an element in the page (by id of that element) is fully loaded.

You will typically want to use something like a developer console to get the ID of the tag.

Selenium allows you to identify elements in other ways too. Classes for instance. And others.

 

 myElem = WebDriverWait(wd, delay).until(EC.presence_of_element_located((By.ID, 'SIP_OV_ClosingPrice')))

Once you run the sample workflow attached, the Python tool will do its thing and get you the results for further downstream processing in that workflow.

I am actually using the REGEX tool to identify a certain tag in that code and get the actual CLOSING VALUE of that ticker I wanted.

 

Closing note

This article demonstrates how to utilize the new Python Code Tool to download information from a website in case the DOWNLOAD tool just won't do.

 

Please note that you should always do your research before scraping information from websites so you comply with the conditions of use of their vendors.

 

David Matyas
Sales Engineer
Alteryx
Alteryx
Alteryx

Just a note - a colleague of mine, let's call him Shaan for the sake of keeping his identity anonymous, reported that he has been getting some exceptions from the Chrome driver. Looks permissions related I think.

 

I haven't had luck so far making this work with the Firefox/ geckodriver so far.

 

But IE seems to work. Even though it's painful. You can get the IE loader from https://www.seleniumhq.org/download/

And then do everything from this website from the REQUIRED CONFIGURATION paragraph.

Including pieces like changing the registry, setting up a ZOOM level for IE and similar.

This is quite funny but you really need to do all those steps otherwise it just won't work.

 

David Matyas
Sales Engineer
Alteryx
Alteryx
Alteryx

Great post @DavidM

 

If you do encounter issues you can test it outside of Alteryx in python/jupyter, as David mentions do not forget to install URLLIB3 and SELENIUM packages first

 

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import pandas

# Start the WebDriver and load the page
# Using Chromium Driver here, need to change path to match your environment
wd = webdriver.Chrome('C:\CD\chromedriver')

# Enter URL you want to scrape
wd.get('https://www.six-group.com/exchanges/bonds/security_info_en.html?id=DE000A19W2L5EUR4')

# For dynamically generated websites wait for a specific ID tag
# We are waiting for SIP_OV_ClosingPrice in this example
delay = 3 # seconds
try:
myElem = WebDriverWait(wd, delay).until(EC.presence_of_element_located((By.ID, "SIP_OV_ClosingPrice")))
print ('Page is ready!')
except TimeoutException:
print ('Loading took too much time!')

# And grab the page whole HTML source
html_page = wd.page_source

# Attempt to close chromium instance
wd.quit()

#Print the HTML page
print(html_page)

 

 

@DavidM Are you me? I've never really coded in Python until Alteryx added a Python tool. My first workflow involved....Web scraping using.....Selenium.  

 

The phrase is a bit overused, but great minds think alike!

Alteryx
Alteryx

Haha, Patrick. That is hilarious :-) I take your note on great minds as a big compliment! Thanks.

David Matyas
Sales Engineer
Alteryx
Labels