Hi everyone,
I am attempting to do some web scraping to save some time when it comes to updating a database I created, but I have never used RegEx before. I am wondering if/how I can parse out "Text" from the following:
<a href="https://www.Example.text.2089A.html">
<span class="icon="></span>
Text
</a>
I hope my question my sense. I can't seem to find a lot of resources that are useful for learning this in Alteryx.
Thanks!
Solved! Go to Solution.
I would suggest trying something like the Regex Coach (http://www.weitz.de/regex-coach/) to help write the RegEx.
Alteryx supports the standard Perl syntax, so you can find various resources for this online.
Looking at your specific text. Something like:
<a[^>]*>\s*(<span[^>]*>\s*</span>)?\s*(.*)\s*</a>
will parse the string.
Breaking it down:
The 'Text' part will be in $2.
You can use a Regex tool in Parse mode to read this out.
Sample attached
Hi @l_blumberger,
How about something like:
regex_replace([text],".*\W(http.*?)\W>.*","$1")
The (http.*?) looks for a non-greedy web address. $1 is the first (and only) group.
I tested with your sample data and got this
parse results
https://www.Example.text.2089A.html
If your actual data is like your example, with the a tag not inside any other tag, then you can use the XML Parse tool configured to Root and Ignore XML Errors and Continue.
If the a tag is within other tags, then you can use the option Specific Child Name with a value of "a" (no quotes) instead of Root.