nav[aria-label="Primary Navigation"] { padding: 0; & ul { list-style: none; width: 100%; display: flex; flex-direction: row; justify-content: start; align-items: start; gap: 30px; padding: 0; & li { margin: 0; } & ul li { list-style: none; } } }

Parsing HTML with Python Tool

Nosal25

After reading many articles about HTML parsing and NOT to use REGEX, which is how I am doing it, with a high level, but not 100% accuracy. Has anyone used a Python HTML parsing package within the Python Tool? I am parsing many fields of HTML/CLOB with REGEX but I am looking for a better way. Thank you

Developer

Python

Best Practices

Help

Accepted answers

AndrewKramer

Beautiful Soup is the most common python package used for web scraping. There are a few other options, including lxml. It is usually a matter of finding the package that best fits your needs

I've attached a screenshot from a very simple example I have. The Python tool uses Jupyter Notebook, so no extra modifications are needed unless you are reading/writing data from Alteryx Designer.

Below is some useful documentation:

Beautiful Soup:

https://codeburst.io/web-scraping-101-with-python-beautiful-soup-bb617be1f486

lxml:

https://docs.python-guide.org/scenarios/scrape/

beautiful_soup.PNG

All comments

JonnyR

While you wait for someone more knowledgeable than myself to reply, I'll suggest the python library "beautiful soup". Without knowing more about your specific needs and use case I can't say for sure if it's the right solution for you but it's a solid html parsing tool.

Nosal25

Excellent idea. How would I use it in the Python Tool? I am looking for an example on how to code the Python tool using such a package.

AndrewKramer

Beautiful Soup is the most common python package used for web scraping. There are a few other options, including lxml. It is usually a matter of finding the package that best fits your needs

I've attached a screenshot from a very simple example I have. The Python tool uses Jupyter Notebook, so no extra modifications are needed unless you are reading/writing data from Alteryx Designer.

Below is some useful documentation:

Beautiful Soup:

https://codeburst.io/web-scraping-101-with-python-beautiful-soup-bb617be1f486

lxml:

https://docs.python-guide.org/scenarios/scrape/

beautiful_soup.PNG

Nosal25

Thank you I understand what I need ot do but I am running into some issues.

I go to import, either bs4 or BeautifulSoup, and I get a ModuleNotFoundError.

To complicate things, I am using an Anaconda instance (which has BeautifulSoup). Does Alteryx need to know where the Anaconda instance is?

Thank you

rd916

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-13-f33cf0cb8fd3> in <module>
----> 1 from bs4 import BeautifulSoup

ModuleNotFoundError: No module named 'bs4'

I am getting the following error- is there anything that I can do? or is there something I am doing wrong?

Quick Links

Popular Tags

This months top contributors

atcodedog05 19598

Qiu 15922

binu_acs 15783

MarqueeCrew 13710

apathetichell 13703