Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Simple regex query - extract url of csv file from html of webpage

jt_edin
8 - Asteroid

I know this is trivial to a regex pro, but to me it's daunting because I still struggle with regex! I know, I know. But if it's useful to me it's going to be useful to someone else too, hence this post. Kudos up for grabs!

 

This is my url: https://edinburghcyclehire.com/open-data/historical

 

I want to extract any hyperlink to a csv file to new rows. I want to discard all other data. The pattern starts with href=" and then the pattern stops after .csv

 

I could do this with text to columns but regex would be nicer. I note that the url appears twice (once in the href= tag, and once with the content= tag. I just need each url once.

 

Please can someone show me how to do this quickly and elegantly in regex? I have attached the url and download tool to save time. Thanks

 

regex.PNG

7 REPLIES 7
Dazzerman
11 - Bolide

Hi @jt_edin ,

 

I just tried this Regex quickly, which pulled out several of the CSV paths.

 

href="([^"]*?csv)

 

Will fire my laptop up with Alteryx on to provide a fuller workflow, unless someone beats me to it!  🙂

LukeG
Alteryx Alumni (Retired)

Hi @jt_edin 

 

I put @Dazzerman solution into a workflow for you.

 

Using the regex tool and the expression from @Dazzerman comment on the DownloadData field, you can parse each url into an individual row.

 

Be sure to use the Tokenize method and Split to Rows in the regex tool as well.

 

Let me know if there are any questions!

- Luke

Dazzerman
11 - Bolide

Thanks @LukeG !

 

Thanks for posting!

 

I got my machine and solution up and running just a few minutes after your post, and the only difference with my solution was that I added a Select to just pull out the tokenised field.  Oh, and I added one more double-quote after the closing bracket that may not be necessary, but may be belt and braces in case the 'csv' text appeared in the URL.  That's not the case at the moment, but you never know!

 

Hope this helps @jt_edin 

jt_edin
8 - Asteroid

Wonderful. Thank you @LukeG and @Dazzerman 

 

But what does it mean? Why do we need all the ([^"]*? symbols between our start and end strings? Thanks

jt_edin
8 - Asteroid

Would you be able to share the workflow please? Thanks ever so much

Dazzerman
11 - Bolide

No problem @jt_edin 

 

I'll break the code down for you.

 

href="

 

The first part above is the 'literal' text to anchor the first part of what you are trying to find.

 

(

 

The opening bracket marks the start of your Marked Group to capture what you're after.

 

[^"]

 

This string captures a single character so long as it isn't a double-quote character.

 

*?

 

The asterisk repeats the previous character until such time as it finds a double-quote, but the question mark character makes the search stop as soon as the next character(s) are found.

 

csv)

 

The final text matches the end of the string you're looking for, and then closes the Marked Group that will then be used as the Token in the Regex tool.

 

IF you then put another double-quote after the closing bracket, then you will force the regex to only match what you want, which is the end of the file url, rather than matching with some random 'csv' text that could have been part of the file name.  There are other ways you could achieve this as well, but this works!

🙂

Dazzerman
11 - Bolide

Sure, no problem.

Labels
Top Solution Authors