Alteryx Designer Desktop Discussions

Nosal25 · ‎01-14-2019

After reading many articles about HTML parsing and NOT to use REGEX, which is how I am doing it, with a high level, but not 100% accuracy. Has anyone used a Python HTML parsing package within the Python Tool? I am parsing many fields of HTML/CLOB with REGEX but I am looking for a better way. Thank you

JonnyR · ‎01-14-2019

While you wait for someone more knowledgeable than myself to reply, I'll suggest the python library "beautiful soup". Without knowing more about your specific needs and use case I can't say for sure if it's the right solution for you but it's a solid html parsing tool.

Nosal25 · ‎01-14-2019

Excellent idea. How would I use it in the Python Tool? I am looking for an example on how to code the Python tool using such a package.

JonnyR · ‎01-14-2019

My company, unfortunately, only has Alteryx 11.7 installed so I can't create an example workbook for you (the python tool was only added in 2018.1/2). For how to install 3rd party libraries, see this. For help with the python/ beautiful soup code for parsing see this.

AndrewKramer · ‎01-16-2019

Beautiful Soup is the most common python package used for web scraping. There are a few other options, including lxml. It is usually a matter of finding the package that best fits your needs

I've attached a screenshot from a very simple example I have. The Python tool uses Jupyter Notebook, so no extra modifications are needed unless you are reading/writing data from Alteryx Designer.

Below is some useful documentation:

Beautiful Soup:

https://codeburst.io/web-scraping-101-with-python-beautiful-soup-bb617be1f486

lxml:

https://docs.python-guide.org/scenarios/scrape/

Nosal25 · ‎01-23-2019

Thank you I understand what I need ot do but I am running into some issues.

I go to import, either bs4 or BeautifulSoup, and I get a ModuleNotFoundError.

To complicate things, I am using an Anaconda instance (which has BeautifulSoup). Does Alteryx need to know where the Anaconda instance is?

Thank you

AndrewKramer · ‎01-24-2019

The Python Tool in Designer uses its own version of Miniconda, which may not have BeautifulSoup installed by default. You will need to install the necessary packages.

Use the following syntax in the Python Tool to install the packages

from ayx import Package

Package.installPackages(['beautifulSoup4'])

After this, the following should work:

from bs4 import BeautifulSoup

Nosal25 · ‎01-29-2019

Thank you. I need to get admin privileges on my machine to do that but I can get everything else to work in command line and Anaconda. Now I just need to get it inside Alteryx.

Nosal25 · ‎03-07-2019

Finally everything is set up. Now my question is how do I apply this against one column in a database table? For some reason, the HTML is stored in an Oracle table within the database with other columns. Of course, not a column with just the clean text out of the HTML.

Thank you for you assistance. I am hoping this will finally get me over the hump.

AndrewKramer · ‎03-07-2019

You'll need to setup an ODBC Driver in Alteryx to reach out to Oracle and grab your field with the HTML.

https://help.alteryx.com/2018.2/DataSources/Oracle.htm

You can then read this into the Python tool and use the libraries to parse the html.

Alteryx Designer Desktop Discussions

Parsing HTML with Python Tool

Re: Row creation

Re: How to select columns dynamically using number...

Re: Batch macro to read 1000+ .xlsx files with var...

Re: Issue when using Block Until Done and Power BI...

Example workflow for setting up a custom list to u...