I know this is trivial to a regex pro, but to me it's daunting because I still struggle with regex! I know, I know. But if it's useful to me it's going to be useful to someone else too, hence this post. Kudos up for grabs!
This is my url: https://edinburghcyclehire.com/open-data/historical
I want to extract any hyperlink to a csv file to new rows. I want to discard all other data. The pattern starts with href=" and then the pattern stops after .csv
I could do this with text to columns but regex would be nicer. I note that the url appears twice (once in the href= tag, and once with the content= tag. I just need each url once.
Please can someone show me how to do this quickly and elegantly in regex? I have attached the url and download tool to save time. Thanks
Solved! Go to Solution.
Hi @jt_edin ,
I just tried this Regex quickly, which pulled out several of the CSV paths.
href="([^"]*?csv)
Will fire my laptop up with Alteryx on to provide a fuller workflow, unless someone beats me to it! 🙂
Hi @jt_edin
I put @Dazzerman solution into a workflow for you.
Using the regex tool and the expression from @Dazzerman comment on the DownloadData field, you can parse each url into an individual row.
Be sure to use the Tokenize method and Split to Rows in the regex tool as well.
Let me know if there are any questions!
- Luke
Thanks @LukeG !
Thanks for posting!
I got my machine and solution up and running just a few minutes after your post, and the only difference with my solution was that I added a Select to just pull out the tokenised field. Oh, and I added one more double-quote after the closing bracket that may not be necessary, but may be belt and braces in case the 'csv' text appeared in the URL. That's not the case at the moment, but you never know!
Hope this helps @jt_edin
Wonderful. Thank you @LukeG and @Dazzerman
But what does it mean? Why do we need all the ([^"]*? symbols between our start and end strings? Thanks
Would you be able to share the workflow please? Thanks ever so much
No problem @jt_edin
I'll break the code down for you.
href="
The first part above is the 'literal' text to anchor the first part of what you are trying to find.
(
The opening bracket marks the start of your Marked Group to capture what you're after.
[^"]
This string captures a single character so long as it isn't a double-quote character.
*?
The asterisk repeats the previous character until such time as it finds a double-quote, but the question mark character makes the search stop as soon as the next character(s) are found.
csv)
The final text matches the end of the string you're looking for, and then closes the Marked Group that will then be used as the Token in the Regex tool.
IF you then put another double-quote after the closing bracket, then you will force the regex to only match what you want, which is the end of the file url, rather than matching with some random 'csv' text that could have been part of the file name. There are other ways you could achieve this as well, but this works!
🙂