Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Parsing with Regex - A newbie dilemma 2

l_blumberger
7 - Meteor

Hi everyone, 


I am attempting to do some web scraping to save some time when it comes to updating a database I created, but I have never used RegEx before. I am wondering if/how I can parse out "Text" from the following:

 

<a href="https://www.Example.text.2089A.html">

<span class="icon="></span>
Text

</a>

 

 

I hope my question my sense. I can't seem to find a lot of resources that are useful for learning this in Alteryx.


Thanks!

3 REPLIES 3
jdunkerley79
ACE Emeritus
ACE Emeritus

I would suggest trying something like the Regex Coach (http://www.weitz.de/regex-coach/) to help write the RegEx.

 

Alteryx supports the standard Perl syntax, so you can find various resources for this online.

 

Looking at your specific text. Something like:

<a[^>]*>\s*(<span[^>]*>\s*</span>)?\s*(.*)\s*</a>

will parse the string.

 

Breaking it down:

  • <a[^>]*> reads the first a tag
  • \s* ignore any white space
  • (<span[^>]*>\s*</span>)? reads the span open and close tag if it exists (into $1)
  • (.*) greedily reads anything
  • </a> matches the closing a tag

 

The 'Text' part will be in $2.

 

You can use a Regex tool in Parse mode to read this out.

 

Sample attached 

MarqueeCrew
20 - Arcturus
20 - Arcturus

Hi @l_blumberger,

 

How about something like:

 

regex_replace([text],".*\W(http.*?)\W>.*","$1")

The (http.*?) looks for a non-greedy web address.  $1 is the first (and only) group.

 

I tested with your sample data and got this

 

parse results

https://www.Example.text.2089A.html

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
Joe_Mako
12 - Quasar

If your actual data is like your example, with the a tag not inside any other tag, then you can use the XML Parse tool configured to Root and Ignore XML Errors and Continue.

 

If the a tag is within other tags, then you can use the option Specific Child Name with a value of "a" (no quotes) instead of Root.

 

xml parse.png

Labels