Parsing HTML with Python Tool
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
After reading many articles about HTML parsing and NOT to use REGEX, which is how I am doing it, with a high level, but not 100% accuracy. Has anyone used a Python HTML parsing package within the Python Tool? I am parsing many fields of HTML/CLOB with REGEX but I am looking for a better way. Thank you
Solved! Go to Solution.
- Labels:
- Best Practices
- Developer
- Help
- Python
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
While you wait for someone more knowledgeable than myself to reply, I'll suggest the python library "beautiful soup". Without knowing more about your specific needs and use case I can't say for sure if it's the right solution for you but it's a solid html parsing tool.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Excellent idea. How would I use it in the Python Tool? I am looking for an example on how to code the Python tool using such a package.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Beautiful Soup is the most common python package used for web scraping. There are a few other options, including lxml. It is usually a matter of finding the package that best fits your needs
I've attached a screenshot from a very simple example I have. The Python tool uses Jupyter Notebook, so no extra modifications are needed unless you are reading/writing data from Alteryx Designer.
Below is some useful documentation:
Beautiful Soup:
https://codeburst.io/web-scraping-101-with-python-beautiful-soup-bb617be1f486
lxml:
https://docs.python-guide.org/scenarios/scrape/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Thank you I understand what I need ot do but I am running into some issues.
I go to import, either bs4 or BeautifulSoup, and I get a ModuleNotFoundError.
To complicate things, I am using an Anaconda instance (which has BeautifulSoup). Does Alteryx need to know where the Anaconda instance is?
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
The Python Tool in Designer uses its own version of Miniconda, which may not have BeautifulSoup installed by default. You will need to install the necessary packages.
Use the following syntax in the Python Tool to install the packages
from ayx import Package
Package.installPackages(['beautifulSoup4'])
After this, the following should work:
from bs4 import BeautifulSoup
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Thank you. I need to get admin privileges on my machine to do that but I can get everything else to work in command line and Anaconda. Now I just need to get it inside Alteryx.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Finally everything is set up. Now my question is how do I apply this against one column in a database table? For some reason, the HTML is stored in an Oracle table within the database with other columns. Of course, not a column with just the clean text out of the HTML.
Thank you for you assistance. I am hoping this will finally get me over the hump.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
You'll need to setup an ODBC Driver in Alteryx to reach out to Oracle and grab your field with the HTML.
https://help.alteryx.com/2018.2/DataSources/Oracle.htm
You can then read this into the Python tool and use the libraries to parse the html.