Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Parse HTML using RegEx

hellyars
13 - Pulsar

I am trying to extract fields and data from an HTML field.  The HTML field is itself a table.

 

Here is the target field. 

</TR><BR><TABLE BORDER=2 WIDTH="100%"><TR><TD><TABLE BORDER=0 WIDTH="100%"><TR><TD WIDTH=300><B>Site ID:</B> 1</TD><TD WIDTH=300><B>Antenna ID:</B> 1         <TR><TD WIDTH=300><B>Manufacturer:</B> SCIENTIFIC-ATLANTA</TD><TD WIDTH=300><B>Diameter (meters):</B> 3.6<TR><TD WIDTH=300><B>Model:</B> MODEL 8136</TD><TD WIDTH=300><B>Diameter Minor (m):</B> 0<TR><TD WIDTH=300><B>Building Height AGL (m):</B> 0</TD><TD WIDTH=300><B>Diameter Major (m):</B> 0<TR><TD WIDTH=300><B>Max Antenna Height AGL (m):</B> 5.2</TD><TD WIDTH=300><B>Quantity:</B> 2<TR><TD WIDTH=300><B>Max Antenna Height AMSL (m):</B> 14.3</TD><TD WIDTH=300><B>Total Power (Watts):</B> 0<TR><TD WIDTH=300><B>Max Antenna Height Above Rooftop (m):</B> 0</TD><TD WIDTH=300><B>Total EIRP (dBW):</B> 0</TABLE></TD></TR></TABLE>

 

Here is my expression.

 

^.*?(\bSite\b.*?\<.*?)\<.*?(\bAntenna ID:\<.*?)\<.*?(\bManufacturer:\<.*?)\<.*?(\bDiameter\s.*?\<.*?)\<.*?(\bModel:\<.*?)\<.*?(\bDiameter\b.*?\<.*?)\<.*?(\bBuilding\b.*?\<.*?)\<.*?(\bDiameter\b.*?\<.*?)\<.*?(\bMax\b.*?\<.*?)\<.*?(\bQuantity\b.*?\<.*?)\<.*?(\bMax\b.*?\<.*?)\<.*?(\bTotal\b.*?\<.*?)\<.*$

It is pretty ugly.  Each output includes the field and the data plus some unnecessary HTML tags.  I figure a few additional parse steps are necessary to clean everything up.  If there is an easier way I am open to it.  

 

The problem is that the expression works with regex101.com or a desktop app like Patterns or RegExRX, but it does not work with the Alteryx RegEx tool (set to parse).  The output fields are null.  What am I doing wrong?

 

Screen Shot 2019-10-31 at 7.25.58 PM.png

Screen Shot 2019-10-31 at 7.21.27 PM.pngScreen Shot 2019-10-31 at 7.24.07 PM.png

3 REPLIES 3
PhilipMannering
16 - Nebula
16 - Nebula

No idea why your regex doesn't work in Alteryx (to be honest I didn't really check).

 

But you might find my solution helpful. It's a couple of regex tools set to Tokenize, but I've tried to keep the number of characters down to a minimum,

 

Regex HTML.jpg

hellyars
13 - Pulsar

@PhilipMannering  This is nice and clean.  I will start exploring this technique.  I have a lot of HTML source documents.  

 

How do I get the first expression <TD.*?(?=<TD) to match the last <TD element in the example. It captures everything but Total EIRP (dBW):

 

PhilipMannering
16 - Nebula
16 - Nebula

Ah yes. You're absolutely right.

 

See attached.

Labels