Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Parsing HTML with Python Tool

Nosal25
8 - Asteroid

After reading many articles about HTML parsing and NOT to use REGEX, which is how I am doing it, with a high level, but not 100% accuracy.  Has anyone used a Python HTML parsing package within the Python Tool? I am parsing many fields of HTML/CLOB with REGEX but I am looking for a better way.  Thank you

 

11 REPLIES 11
JonnyR
7 - Meteor

While you wait for someone more knowledgeable than myself to reply, I'll suggest the python library "beautiful soup". Without knowing more about your specific needs and use case I can't say for sure if it's the right solution for you but it's a solid html parsing tool.

Nosal25
8 - Asteroid

Excellent idea. How would I use it in the Python Tool? I am looking for an example on how to code the Python tool using such a package.

JonnyR
7 - Meteor

My company, unfortunately, only has Alteryx 11.7 installed so I can't create an example workbook for you (the python tool was only added in 2018.1/2). For how to install 3rd party libraries, see this. For help with the python/ beautiful soup code for parsing see this

AndrewKramer
Alteryx Alumni (Retired)

Beautiful Soup is the most common python package used for web scraping. There are a few other options, including lxml. It is usually a matter of finding the package that best fits your needs

 

I've attached a screenshot from a very simple example I have. The Python tool uses Jupyter Notebook, so no extra modifications are needed unless you are reading/writing data from Alteryx Designer.

 

Below is some useful documentation:

Beautiful Soup:

https://codeburst.io/web-scraping-101-with-python-beautiful-soup-bb617be1f486

 

lxml:

https://docs.python-guide.org/scenarios/scrape/

 

Nosal25
8 - Asteroid

Thank you I understand what I need ot do but I am running into some issues.

 

I go to import, either bs4 or BeautifulSoup, and I get a ModuleNotFoundError. 

 

To complicate things, I am using an Anaconda instance (which has BeautifulSoup). Does Alteryx need to know where the Anaconda instance is? 

 

Thank you

AndrewKramer
Alteryx Alumni (Retired)

The Python Tool in Designer uses its own version of Miniconda, which may not have BeautifulSoup installed by default. You will need to install the necessary packages.

 

Use the following syntax in the Python Tool to install the packages

from ayx import Package

Package.installPackages(['beautifulSoup4'])

 

After this, the following should work:

from bs4 import BeautifulSoup

Nosal25
8 - Asteroid

Thank you.  I need to get admin privileges on my machine to do that but I can get everything else to work in command line and Anaconda.  Now I just need to get it inside Alteryx.  

 

Nosal25
8 - Asteroid

Finally everything is set up.  Now my question is how do I apply this against one column in a database table?  For some reason, the HTML is stored in an Oracle table within the database with other columns.  Of course, not a column with just the clean text out of the HTML.

 

Thank you for you assistance. I am hoping this will finally get me over the hump.

AndrewKramer
Alteryx Alumni (Retired)

You'll need to setup an ODBC Driver in Alteryx to reach out to Oracle and grab your field with the HTML.

 

https://help.alteryx.com/2018.2/DataSources/Oracle.htm

 

You can then read this into the Python tool and use the libraries to parse the html.

Labels