Alteryx Designer Desktop Discussions

insomned · ‎08-22-2023

Hello,

Could someone please help me - I need to extract the physical location (street, country, etc) from a web page (web page is not in a table format). I have already used "Text Input" and "Download" tools but not sure what to use in Regex to parse.

Thanks a lot!

mceleavey · ‎08-22-2023

Hi @insomned ,

yes we can.

Would you care to provide the website or the extracted raw HTML along with a guide as to what you would like to extract?

M.

insomned · ‎08-22-2023

Welcome to City of Adairsville, GA (adairsvillega.net)

Would be the website and would need to extract the address at the bottom together with the phone number.

Thanks!

FinnCharlton · ‎08-22-2023

Hi @insomned , if you're new to RegEx and webscraping, I think this is the easiest method to get started with. Let's say your trying to parse this fake HTML to find the street and country:

<head><tr>STREET:"West Street"<\tr><tr>COUNTRY:"United Kingdom"<\tr><\head>

Copy it all into the Regex tool and select the 'Parse' option. Replace the parts you want to extract with (.*?):

You can see how this has automatically extracted what I need. This is the simple concept, although there are a couple more techniques you'll need to webscrape effectively. Finding a repeating unit of HTML containing a row's worth of data is crucial, as it will allow you to repeat this RegEx process automatically on each row. You may also want to practice some more RegEx, as there are many situations where this simple approach will need some amendments. For example, you might need to make the RegEx more dynamic, or learn how to escape special characters (like I've done with the backslashes above). Anyway, I hope this helps a bit, good luck!

mceleavey · ‎08-22-2023

Hi @insomned ,

The important way to think about HTML is just structured text using tags to denote content.

I've built this to fist isolate the section that contains the required info:

Screenshot 2023-08-22 144658.jpg

This gives the following HTML:

" target="_blank">City of Adairsville<br />116 Public Square <br />Adairsville, GA 30103 </a>
<a href="tel:770-773-3451 " class="footer-phone footer-delimiter">770-773-3451 </a>

I can then parse out the bits in between the closed tags:

>(.*?)<

Then I simply tidy it up:

Screenshot 2023-08-22 144835.jpg

Workflow attached.

I hope this helps.

M.

Alteryx Designer Desktop Discussions

Web Scraping - Regex

Re: Date Time Function - Prioritization Base on Du...

Re: Running multiple alteryx workflows within alte...

Re: Selecting the columns coming after a specific ...

Re: Regex(?) formula to remove values matching the...

Re: Replacing Column Headings