Let’s talk Alteryx Copilot. Join the live AMA event to connect with the Alteryx team, ask questions, and hear how others are exploring what Copilot can do. Have Copilot questions? Ask here!
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

HTML parsing help

JamesGray
7 - Meteor

Hi,

 

I have limited parsing / regex experience and am struggling due to the volume of HTML per the site. So would really appreciate someone with more experience to help.

 

 

I am looking to parse out the data from the following gov website links: https://www.applytosupply.digitalmarketplace.service.gov.uk/g-cloud/search

 

These include each search result in the link above. You will see in each page there is a section called "Reseller", see example around 2/3 down this page Analytics and Data Science Service 

 

I would like to be able to parse out into a column with header "Supplier Type" and then the text shown in the corresponding field next to this per the web page.

 

Another example: FourNet (4net) Cloud Unified Communications (UCaaS) . Shows that there is a subfield underneath Reseller called "Organisation whose services are being resold". I would like if this field exists to also parse out the data for these into another field.

 

Thank you for any help.

 

1 REPLY 1
OllieClarke
16 - Nebula
16 - Nebula

Hi @JamesGray 

 

Based on your post I've made this:

OllieClarke_0-1678880573358.png

 

The first RegEx isolates just the resellers table (grabbing anything after the resellers scroll tracking, but before the next scroll tracking)

OllieClarke_1-1678880593974.png

The next RegEx takes this isolated table, and tokenises out the information in it - basically anything immediately before the </dt or </dd closing html tags 

OllieClarke_2-1678880814532.png

I keep those closing tags in the output so we can use them to isolate what's a header (</dt) and what's a value (</dd). We do a bit of cleaning, and then we just create a record ID within each type and url, and then we can transform into the structure you want.

OllieClarke_3-1678880904458.png

 

Hope that helps,

 

Ollie

 

 

 

 

Labels
Top Solution Authors