Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Please improve my html parsing with regex instead of text to columns

jt_edin
8 - Asteroid

I dabble with regex so infrequently that I always forget what I've learned I'm afraid, but I'm hoping this post might help others too.

 

I've attached my workflow, which uses a Formula->Replace followed by Text to Cols methodology to parse some data from an html page. This uses a technique used here https://www.thedataschool.co.uk/robbin-vernooij/web-scraping-html-tables-an-alteryx-workflow-and-r-s...

 

The problem is that I'm replacing html strings with an unusual character (~) on which to split the string into rows, but if that character were to appear in the source data then I'd be in trouble. Also, regex ought to be much more concise and sophisticated. So if anyone can show me how to extract the information I need using regex (tokenize or parse perhaps?!?) I'd be very grateful.

 

Thanks

 

https://www.zolo.ca/toronto-real-estate/commercial/page-1

workflow.JPG

desired outcome.JPG

2 REPLIES 2
LordNeilLord
15 - Aurora

Hey @jt_edin 

 

I had a go to see if it could be done in one regex tool...turns out it can!

 

Capture.PNG

jt_edin
8 - Asteroid

Awesome. But please could you explain (.*?) vs .*?

 

What is a marked group? And in what circumstances would you use tokenize instead of parse? Thanks

Labels