Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Parse HTML using RegEx

hellyars
13 - Pulsar

I am trying to extract fields and data from an HTML field.  The HTML field is itself a table.

 

Here is the target field. 

</TR><BR><TABLE BORDER=2 WIDTH="100%"><TR><TD><TABLE BORDER=0 WIDTH="100%"><TR><TD WIDTH=300><B>Site ID:</B> 1</TD><TD WIDTH=300><B>Antenna ID:</B> 1         <TR><TD WIDTH=300><B>Manufacturer:</B> SCIENTIFIC-ATLANTA</TD><TD WIDTH=300><B>Diameter (meters):</B> 3.6<TR><TD WIDTH=300><B>Model:</B> MODEL 8136</TD><TD WIDTH=300><B>Diameter Minor (m):</B> 0<TR><TD WIDTH=300><B>Building Height AGL (m):</B> 0</TD><TD WIDTH=300><B>Diameter Major (m):</B> 0<TR><TD WIDTH=300><B>Max Antenna Height AGL (m):</B> 5.2</TD><TD WIDTH=300><B>Quantity:</B> 2<TR><TD WIDTH=300><B>Max Antenna Height AMSL (m):</B> 14.3</TD><TD WIDTH=300><B>Total Power (Watts):</B> 0<TR><TD WIDTH=300><B>Max Antenna Height Above Rooftop (m):</B> 0</TD><TD WIDTH=300><B>Total EIRP (dBW):</B> 0</TABLE></TD></TR></TABLE>

 

Here is my expression.

 

^.*?(\bSite\b.*?\<.*?)\<.*?(\bAntenna ID:\<.*?)\<.*?(\bManufacturer:\<.*?)\<.*?(\bDiameter\s.*?\<.*?)\<.*?(\bModel:\<.*?)\<.*?(\bDiameter\b.*?\<.*?)\<.*?(\bBuilding\b.*?\<.*?)\<.*?(\bDiameter\b.*?\<.*?)\<.*?(\bMax\b.*?\<.*?)\<.*?(\bQuantity\b.*?\<.*?)\<.*?(\bMax\b.*?\<.*?)\<.*?(\bTotal\b.*?\<.*?)\<.*$

It is pretty ugly.  Each output includes the field and the data plus some unnecessary HTML tags.  I figure a few additional parse steps are necessary to clean everything up.  If there is an easier way I am open to it.  

 

The problem is that the expression works with regex101.com or a desktop app like Patterns or RegExRX, but it does not work with the Alteryx RegEx tool (set to parse).  The output fields are null.  What am I doing wrong?

 

Screen Shot 2019-10-31 at 7.25.58 PM.png

Screen Shot 2019-10-31 at 7.21.27 PM.pngScreen Shot 2019-10-31 at 7.24.07 PM.png

3 REPLIES 3
PhilipMannering
16 - Nebula
16 - Nebula

No idea why your regex doesn't work in Alteryx (to be honest I didn't really check).

 

But you might find my solution helpful. It's a couple of regex tools set to Tokenize, but I've tried to keep the number of characters down to a minimum,

 

Regex HTML.jpg

hellyars
13 - Pulsar

@PhilipMannering  This is nice and clean.  I will start exploring this technique.  I have a lot of HTML source documents.  

 

How do I get the first expression <TD.*?(?=<TD) to match the last <TD element in the example. It captures everything but Total EIRP (dBW):

 

PhilipMannering
16 - Nebula
16 - Nebula

Ah yes. You're absolutely right.

 

See attached.

Labels
Top Solution Authors