Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Web Scraping - Regex

insomned
8 - Asteroid

Hello, 

 

Could someone please help me - I need to extract the physical location (street, country, etc) from a web page (web page is not in a table format). I have already used "Text Input" and "Download" tools but not sure what to use in Regex to parse.

 

Thanks a lot! 

4 REPLIES 4
mceleavey
17 - Castor
17 - Castor

Hi @insomned ,

 

yes we can.

Would you care to provide the website or the extracted raw HTML along with a guide as to what you would like to extract?

 

M.



Bulien

insomned
8 - Asteroid

Welcome to City of Adairsville, GA (adairsvillega.net) 

 

Would be the website and would need to extract the address at the bottom together with the phone number.

 

Thanks!

 

 

 

FinnCharlton
13 - Pulsar

Hi @insomned , if you're new to RegEx and webscraping, I think this is the easiest method to get started with. Let's say your trying to parse this fake HTML to find the street and country:

<head><tr>STREET:"West Street"<\tr><tr>COUNTRY:"United Kingdom"<\tr><\head>

 

Copy it all into the Regex tool and select the 'Parse' option. Replace the parts you want to extract with (.*?):

 

image.png

 

You can see how this has automatically extracted what I need. This is the simple concept, although there are a couple more techniques you'll need to webscrape effectively. Finding a repeating unit of HTML containing a row's worth of data is crucial, as it will allow you to repeat this RegEx process automatically on each row. You may also want to practice some more RegEx, as there are many situations where this simple approach will need some amendments. For example, you might need to make the RegEx more dynamic, or learn how to escape special characters (like I've done with the backslashes above). Anyway, I hope this helps a bit, good luck!

 

 

mceleavey
17 - Castor
17 - Castor

Hi @insomned ,

 

The important way to think about HTML is just structured text using tags to denote content.

 

I've built this to fist isolate the section that contains the required info:

 

Screenshot 2023-08-22 144658.jpg

This gives the following HTML:

" target="_blank">City of Adairsville<br />116 Public Square <br />Adairsville, GA 30103 </a>
<a href="tel:770-773-3451 " class="footer-phone footer-delimiter">770-773-3451 </a>

 

I can then parse out the bits in between the closed tags:

>(.*?)<

Then I simply tidy it up:

Screenshot 2023-08-22 144835.jpg

 

Workflow attached.

I hope this helps.

 

M.



Bulien

Labels