Do you use Alteryx in a language other than English? If so, we want to hear from you! Please help us improve the international experience of our products by participating in this 5 minute survey.

We are updating the requirements for Community registration. As of 7/21/21 all users will be required to register a phone number with their My Alteryx accounts. If you have already registered, you will be prompted on your next login to add your phone number.

Alteryx Designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Python Code Tool - Web Scraping Dynamic Websites Using Selenium

DavidM
Alteryx
Alteryx

Hey everyone,

 

Let me start with the bottom line here. I like to be practical:

 

This article is about using the new Python Code tool to help you download pretty much any data you need from a website of your choice.

Even if dynamically generated (DOWNLOAD tool won't do) and use it wherever you need in your workflow.

As a by-product, we are actually demonstrating one of the ways that the new Python Code tool is super powerful to give you even more flexibility in Alteryx.

 

Workflow with all this is attached to the post.

 

image.png

 

Cool. Now that the practical part of me got that bottom line thing out, let me try to share my thought process here (and a few links to great resources). Here goes:

 

Just recently Alteryx announced the release of 2018-3. One of the amazing new features as part of that release is the Jupyter Notebooks integration for Python.

I am not the one to shy away from a bit of coding. I did a bit of SQL for living for about 7 years (read as: a couple of thousand lines a month and dreaming about code). I also picked up some HTML/CSS/JavaScript and Java as well here and there.

 

But Python got me intrigued. I have started with it not a long time ago and it always amazes me how easy it is to pick up and do things.

The fact that the Python is an open platform and you can easily get millions of super handy packages is just brilliant. Makes it easy to do super cool stuff.

 

Let me share one of the use cases for Python to be used with Alteryx Designer that I went through recently with one of our customers.

 

Use case

Download pricing information of some bonds from a trading website. (Note: this is not the actual customer's ticker. Using this one as an example only).

This use case can be well generalized to downloading just about any data from any website.

 

Use case details

As an analyst in a bank, for instance, you typically need to access a website for some kind of trading information several times a day.

So it only makes sense to use Alteryx Designer to get that for you and automate that process. Especially if you need that input to work with in your workflow.

 

How to do it

The first choice for similar tasks is to typically use the DOWNLOAD tool with the REST API of the website/ source of your choice.

Yes, Google Finance or Yahoo Finance websites will have this option in place OK. But not every website out there expose their APIs.

 

How to do it 2

These large websites do not typically quote those non-world-wide financial products information. Say some bonds/shares that are issued only in one country or so.

For this reason, you may be left with one or two websites that actually quote the information you need but don't have the REST API.

In that case, you can typically use the DOWNLOAD tool in Alteryx with the GET method to download the HTML content of such website.
And typically combine this with some REGEX tool to parse the data you need from that code.

 

Wee bit of Problem

What if the above suggested does not work though? You don't have REST API for the website of your choice.

And the website is actually dynamically generated. That means using something like asynchronous JavaScript to generate the content.

This means that when you try to web scrape the webpage using GET with DOWNLOAD tool, it will get you some (but not all) code and a ton of placeholders.

 

Solution

Use the new Python Code tool withing Alteryx Designer to help you do the web scraping with Python package called Selenium to scrape a website and push it back (full HTML) into a workflow for the downstream processing. Then, in your workflow use something like REGEX tool to parse/ tokenize the data you need.

 

"This way you can easily get the data you need from the website of your choice (even if dynamically generated) and use it wherever you need in your workflow."

 

Why is it cool?

As I said - the Download tool in Designer comes with restrictions. If a web is using asynchronous methods of JavaScript or is generally Dynamically generating content (some kind of a web app), the download tool just won’t do. You will get just part of your HTML code typically with placeholders or a lot of script only. No data.

 

With the workflow I built, you use dev testing package called Selenium to open a webpage in a browser automatically, wait until the page is actually fully generated (based on an ID of some tag) and then get the full page scraped then.

 

What is Selenium?

Selenium automates browsers. That's it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes but is certainly not limited to just that. Boring web-based administration tasks can (and should!) be automated as well. More at https://www.seleniumhq.org/

 

Selenium is supported with Python. You can just download it as a package and use it in your code. Really just waiting for you out there.

More at https://selenium-python.readthedocs.io/installation.html#introduction

 

The actual workflow

To make this work, you need to install two packages (urllib3, selenium).

 

Plus you need to install the ChromeDriver to support Selenium http://chromedriver.chromium.org/downloads.

The code is quite simple. First, you need to install the urllib3 and selenium packages first.

  

from ayx import Alteryx

#install the selenium and urllib3 packages
#you may need to elevate privileges for running Designer to admin for doing this
Alteryx.installPackages("urllib3")
Alteryx.installPackages("selenium")

And now the actual code to do the web scrape. 

Make sure you point to that Chromium driver you downloaded before. Just as well replace the URL I was using to the URL of your choice.

 

This will then use Selenium to open that website, wait for a load of a certain element (get all dynamically generated content) and push the code to the number 1 output from the Python tool.

 

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import pandas

# Start the WebDriver and load the page
# Using Chromium Driver here, need to change path to match youe env
wd = webdriver.Chrome("C:/Program Files/Alteryx/bin/Plugins/chromedriver")

# Enter URL you want to scrapewd.get("https://www.six-group.com/exchanges/bonds/security_info_en.html?id=DE000A19W2L5EUR4")

# For dynamically generated websites wait for a specific ID tag
# We are waiting for SIP_OV_ClosingPrice in this example
delay = 3 # seconds
try:
    myElem = WebDriverWait(wd, delay).until(EC.presence_of_element_located((By.ID, 'SIP_OV_ClosingPrice')))
    print ("Page is ready!")
except TimeoutException:
    print ("Loading took too much time!")

# And grab the page whole HTML source
html_page = wd.page_source

# Attempt to close chromium instance
wd.quit()

#Print the HTML page
print(html_page)

#Turn the variabe with html page into Pandas' DF
df = pandas.DataFrame({"html_page":[html_page]})

#Write the data frame to Alteryx workflow for downstream processing
Alteryx.write(df,1)

 

Note that you also need to change that piece here. This says - delay the processing until an element in the page (by id of that element) is fully loaded.

You will typically want to use something like a developer console to get the ID of the tag.

Selenium allows you to identify elements in other ways too. Classes for instance. And others.

 

 myElem = WebDriverWait(wd, delay).until(EC.presence_of_element_located((By.ID, 'SIP_OV_ClosingPrice')))

Once you run the sample workflow attached, the Python tool will do its thing and get you the results for further downstream processing in that workflow.

I am actually using the REGEX tool to identify a certain tag in that code and get the actual CLOSING VALUE of that ticker I wanted.

 

Closing note

This article demonstrates how to utilize the new Python Code Tool to download information from a website in case the DOWNLOAD tool just won't do.

 

Please note that you should always do your research before scraping information from websites so you comply with the conditions of use of their vendors.

 

David Matyas
Sales Engineer
Alteryx
36 REPLIES 36
DavidM
Alteryx
Alteryx

Hi @ibesmond,

 

Seems like you were able to install the chromedriver correctly but the very last printscreen shows that you are experiencing some network issues.

 

I would suggest revisiting this with your IT and point them to what you are trying to do + point them to https://chromedriver.chromium.org/security-considerations

 

Another thing you can do is to research another driver - for instance Firefox/ geckodriver.

 

I would also be curious to see the full exception/error message in the Python tool alone (interactive mode) - as you have only shared a small portion of the exception there, so i don't see it whole.

 

David

David Matyas
Sales Engineer
Alteryx
ibesmond
8 - Asteroid

Hi @DavidM .

 

I came from an accounting background and haven't written HTML since myspace first came out.  20 years later I haven't done much coding.

 

I downloaded geckodriver and tried to use sublime text to run the python code.  I got these errors for both methods. Would it be helpful Re-run the code in pyCharm if that would help.  I do know what you mean by Python tool alone (interactive mode)? Can I run that?

 

sublime text.png

DavidM
Alteryx
Alteryx

Hi @ibesmond,

 

Can you please run the worklfow in Alteryx. And once done, click on the PYTHON CODE tool icon which is part of that worklfow, and on the left part of the screen scroll in the code until you find the errror message. Please if you can share that it would be great.

 

david

David Matyas
Sales Engineer
Alteryx
ibesmond
8 - Asteroid

Here is what I can see @DavidM 

 

---------------------------------------------------------------------------
SessionNotCreatedException                Traceback (most recent call last)
<ipython-input-2-fcccb8dfbe42> in <module>
      8 # Start the WebDriver and load the page
      9 # Using Chromium Driver here, need to change path to match youe env
---> 10 wd = webdriver.Chrome("C:/webdrivers/chromedriver")
     11 
     12 # Enter URL you want to scrape

c:\users\ibesmond\appdata\local\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\selenium\webdriver\chrome\webdriver.py in __init__(self, executable_path, port, options, service_args, desired_capabilities, service_log_path, chrome_options, keep_alive)
     79                     remote_server_addr=self.service.service_url,
     80                     keep_alive=keep_alive),
---> 81                 desired_capabilities=desired_capabilities)     82         except Exception:
     83             self.quit()

c:\users\ibesmond\appdata\local\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\selenium\webdriver\remote\webdriver.py in __init__(self, command_executor, desired_capabilities, browser_profile, proxy, keep_alive, file_detector, options)
    155             warnings.warn("Please use FirefoxOptions to set browser profile",
    156                           DeprecationWarning, stacklevel=2)
--> 157         self.start_session(capabilities, browser_profile)
    158         self._switch_to = SwitchTo(self)
    159         self._mobile = Mobile(self)

c:\users\ibesmond\appdata\local\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\selenium\webdriver\remote\webdriver.py in start_session(self, capabilities, browser_profile)
    250         parameters = {"capabilities": w3c_caps,
    251                       "desiredCapabilities": capabilities}
--> 252         response = self.execute(Command.NEW_SESSION, parameters)
    253         if 'sessionId' not in response:
    254             response = response['value']

c:\users\ibesmond\appdata\local\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\selenium\webdriver\remote\webdriver.py in execute(self, driver_command, params)
    319         response = self.command_executor.execute(driver_command, params)
    320         if response:
--> 321             self.error_handler.check_response(response)
    322             response['value'] = self._unwrap_value(
    323                 response.get('value', None))

c:\users\ibesmond\appdata\local\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\selenium\webdriver\remote\errorhandler.py in check_response(self, response)
    240                 alert_text = value['alert'].get('text')
    241             raise exception_class(message, screen, stacktrace, alert_text)
--> 242         raise exception_class(message, screen, stacktrace)
    243 
    244     def _value_or_default(self, obj, key, default):

SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 83


 

You need to install the ChromeDriver to support Seleniumhttp://chromedriver.chromium.org/downloads

Adjust the path to the driver in the webdriver portion above to your chromium driver

Change the URL you want to scrape in the code above

To be able to scrape the dynamically generated websites (takes some time to fully load) you need to specify an ID of a html tag to wait for. Here "SIP_OV_ClosingPrice" is used. You need to change this to match your use case.

DavidM
Alteryx
Alteryx

Hi @ibesmond ,

 

I think the part that says 

This version of ChromeDriver only supports Chrome version 83

is crucial here.

 

not sure whether its an old version of Chrome you may be using.

 

you may need to find version of Chrome in place and re-download proper version of Chromium Driver

 

http://chromedriver.chromium.org/downloads

David Matyas
Sales Engineer
Alteryx
ibesmond
8 - Asteroid

That did it.  @DavidM . I could have sworn I checked and Chrome was on 83. Downloaded Chromedriver 81. Boom.  Can't believe I mixed that up.  Thank you a hundred times over! 

PragyaChouksey
5 - Atom

Hi,

I'm trying to install Selenium via Alteryx Python tool using below code. However it gives the error. It seems like a permission issue. I'm not sure if I need to raise a request to install Selenium at the specified path in Alteryx folder (c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages). Can you please advise on this.

 

# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
Package.installPackages(['Selenium'])

 

ERROR -

 

Collecting Selenium
Using cached https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98...
Requirement already satisfied: urllib3 in c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages (from Selenium) (1.25.2)
Installing collected packages: Selenium
ERROR: Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: 'c:\\program files\\alteryx\\bin\\miniconda3\\envs\\jupytertool_venv\\Lib\\site-packages\\selenium'
Consider using the `--user` option or check the permissions.
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
<ipython-input-5-57531c9622dd> in <module>
2 # script here (only missing packages will be installed)
3 from ayx import Package
----> 4 Package.installPackages(['Selenium'])

c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\ayx\Package.py in installPackages(package, install_type, debug)
200 print(pip_install_result["msg"])
201 if not pip_install_result["success"]:
--> 202 raise pip_install_result["err"]

c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\ayx\Utils.py in runSubprocess(args_list, debug)
118
119 try:
--> 120 result = subprocess.check_output(args_list, stderr=subprocess.STDOUT)
121 if debug:
122 print("[Subprocess success!]")

c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\subprocess.py in check_output(timeout, *popenargs, **kwargs)
354
355 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
--> 356 **kwargs).stdout
357
358

c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
436 if check and retcode:
437 raise CalledProcessError(retcode, process.args,
--> 438 output=stdout, stderr=stderr)
439 return CompletedProcess(process.args, retcode, stdout, stderr)
440

CalledProcessError: Command '['c:\\program files\\alteryx\\bin\\miniconda3\\envs\\jupytertool_venv\\python.exe', '-m', 'pip', 'install', 'Selenium']' returned non-zero exit status 1.

DavidM
Alteryx
Alteryx

Hi @PragyaChouksey 

 

This looks like permissions issue. You may need to run Alteryx in admin mode to install the package.

 

Plan B is try to install it using CONDA

https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Manage-packages-with-Conda-for-the-Pyt...

 

dm

David Matyas
Sales Engineer
Alteryx
sraheja21
5 - Atom

Hi Guys,

 

I need some help here. 

 

I have already established a connection with the website from where I need to download the data using the chromedriver and packaged like selenium. When I am running the code it is opening up a window in which there are multiple steps taking place to download an excel file but that is happening currently on my desktop. My main goal is to download the excel file in Alteryx so that i could use the same as input in my Alteryx workflow. 

 

Any help here would be much appreciated. 

 

Gist - My code is running perfectly fine in Jupyter in downloading the excel file from a website but the same code in Python tool in Alteryx is downloading the same excel file but on desktop rather i want that file to be considered as input to Alteryx. 

DavidM
Alteryx
Alteryx

@sraheja21 I would suggest you download the file to your file system and then pick it up with an INPUT DATA tool.

 

This will be much simpler than coding the excel parsing directly in Python Code tool of the scratch.

David Matyas
Sales Engineer
Alteryx
Labels