Alteryx Designer Desktop Discussions

l_blumberger · ‎11-14-2016

Hi everyone,

I am attempting to do some web scraping to save some time when it comes to updating a database I created, but I have never used RegEx before. I am wondering if/how I can parse out "Text" from the following:

<span class="icon="></span>
Text

</a>

I hope my question my sense. I can't seem to find a lot of resources that are useful for learning this in Alteryx.

Thanks!

jdunkerley79 · ‎11-14-2016

I would suggest trying something like the Regex Coach (http://www.weitz.de/regex-coach/) to help write the RegEx.

Alteryx supports the standard Perl syntax, so you can find various resources for this online.

Looking at your specific text. Something like:

<a[^>]*>\s*(<span[^>]*>\s*</span>)?\s*(.*)\s*</a>

will parse the string.

Breaking it down:

<a[^>]*> reads the first a tag
\s* ignore any white space
(<span[^>]*>\s*</span>)? reads the span open and close tag if it exists (into $1)
(.*) greedily reads anything
</a> matches the closing a tag

The 'Text' part will be in $2.

You can use a Regex tool in Parse mode to read this out.

Sample attached

MarqueeCrew · ‎11-14-2016

Hi @l_blumberger,

How about something like:

regex_replace([text],".*\W(http.*?)\W>.*","$1")

The (http.*?) looks for a non-greedy web address. $1 is the first (and only) group.

I tested with your sample data and got this

parse results

https://www.Example.text.2089A.html

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.

Joe_Mako · ‎11-14-2016

If your actual data is like your example, with the a tag not inside any other tag, then you can use the XML Parse tool configured to Root and Ignore XML Errors and Continue.

If the a tag is within other tags, then you can use the option Specific Child Name with a value of "a" (no quotes) instead of Root.

Alteryx Designer Desktop Discussions

Parsing with Regex - A newbie dilemma 2