Hi,
I have limited parsing / regex experience and am struggling due to the volume of HTML per the site. So would really appreciate someone with more experience to help.
I am looking to parse out the data from the following gov website links: https://www.applytosupply.digitalmarketplace.service.gov.uk/g-cloud/search
These include each search result in the link above. You will see in each page there is a section called "Reseller", see example around 2/3 down this page Analytics and Data Science Service
I would like to be able to parse out into a column with header "Supplier Type" and then the text shown in the corresponding field next to this per the web page.
Another example: FourNet (4net) Cloud Unified Communications (UCaaS) . Shows that there is a subfield underneath Reseller called "Organisation whose services are being resold". I would like if this field exists to also parse out the data for these into another field.
Thank you for any help.
Hi @JamesGray
Based on your post I've made this:
The first RegEx isolates just the resellers table (grabbing anything after the resellers scroll tracking, but before the next scroll tracking)
The next RegEx takes this isolated table, and tokenises out the information in it - basically anything immediately before the </dt or </dd closing html tags
I keep those closing tags in the output so we can use them to isolate what's a header (</dt) and what's a value (</dd). We do a bit of cleaning, and then we just create a record ID within each type and url, and then we can transform into the structure you want.
Hope that helps,
Ollie