In case you missed the announcement: Alteryx One is here, and so is the Spring Release! Learn more about these new and exciting releases here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

HTML parsing help

JamesGray
7 - Meteor

Hi,

 

I have limited parsing / regex experience and am struggling due to the volume of HTML per the site. So would really appreciate someone with more experience to help.

 

 

I am looking to parse out the data from the following gov website links: https://www.applytosupply.digitalmarketplace.service.gov.uk/g-cloud/search

 

These include each search result in the link above. You will see in each page there is a section called "Reseller", see example around 2/3 down this page Analytics and Data Science Service 

 

I would like to be able to parse out into a column with header "Supplier Type" and then the text shown in the corresponding field next to this per the web page.

 

Another example: FourNet (4net) Cloud Unified Communications (UCaaS) . Shows that there is a subfield underneath Reseller called "Organisation whose services are being resold". I would like if this field exists to also parse out the data for these into another field.

 

Thank you for any help.

 

1 REPLY 1
OllieClarke
15 - Aurora
15 - Aurora

Hi @JamesGray 

 

Based on your post I've made this:

OllieClarke_0-1678880573358.png

 

The first RegEx isolates just the resellers table (grabbing anything after the resellers scroll tracking, but before the next scroll tracking)

OllieClarke_1-1678880593974.png

The next RegEx takes this isolated table, and tokenises out the information in it - basically anything immediately before the </dt or </dd closing html tags 

OllieClarke_2-1678880814532.png

I keep those closing tags in the output so we can use them to isolate what's a header (</dt) and what's a value (</dd). We do a bit of cleaning, and then we just create a record ID within each type and url, and then we can transform into the structure you want.

OllieClarke_3-1678880904458.png

 

Hope that helps,

 

Ollie

 

 

 

 

Labels
Top Solution Authors