After reading many articles about HTML parsing and NOT to use REGEX, which is how I am doing it, with a high level, but not 100% accuracy. Has anyone used a Python HTML parsing package within the Python Tool? I am parsing many fields of HTML/CLOB with REGEX but I am looking for a better way. Thank you
Solved! Go to Solution.
While you wait for someone more knowledgeable than myself to reply, I'll suggest the python library "beautiful soup". Without knowing more about your specific needs and use case I can't say for sure if it's the right solution for you but it's a solid html parsing tool.
Excellent idea. How would I use it in the Python Tool? I am looking for an example on how to code the Python tool using such a package.
Beautiful Soup is the most common python package used for web scraping. There are a few other options, including lxml. It is usually a matter of finding the package that best fits your needs
I've attached a screenshot from a very simple example I have. The Python tool uses Jupyter Notebook, so no extra modifications are needed unless you are reading/writing data from Alteryx Designer.
Below is some useful documentation:
Beautiful Soup:
https://codeburst.io/web-scraping-101-with-python-beautiful-soup-bb617be1f486
lxml:
https://docs.python-guide.org/scenarios/scrape/
Thank you I understand what I need ot do but I am running into some issues.
I go to import, either bs4 or BeautifulSoup, and I get a ModuleNotFoundError.
To complicate things, I am using an Anaconda instance (which has BeautifulSoup). Does Alteryx need to know where the Anaconda instance is?
Thank you
The Python Tool in Designer uses its own version of Miniconda, which may not have BeautifulSoup installed by default. You will need to install the necessary packages.
Use the following syntax in the Python Tool to install the packages
from ayx import Package
Package.installPackages(['beautifulSoup4'])
After this, the following should work:
from bs4 import BeautifulSoup
Thank you. I need to get admin privileges on my machine to do that but I can get everything else to work in command line and Anaconda. Now I just need to get it inside Alteryx.
Finally everything is set up. Now my question is how do I apply this against one column in a database table? For some reason, the HTML is stored in an Oracle table within the database with other columns. Of course, not a column with just the clean text out of the HTML.
Thank you for you assistance. I am hoping this will finally get me over the hump.
You'll need to setup an ODBC Driver in Alteryx to reach out to Oracle and grab your field with the HTML.
https://help.alteryx.com/2018.2/DataSources/Oracle.htm
You can then read this into the Python tool and use the libraries to parse the html.
User | Count |
---|---|
19 | |
15 | |
15 | |
8 | |
6 |