Alteryx Designer Desktop Discussions

hellyars · ‎10-31-2019

I am trying to extract fields and data from an HTML field. The HTML field is itself a table.

Here is the target field.

</TR><BR><TABLE BORDER=2 WIDTH="100%"><TR><TD><TABLE BORDER=0 WIDTH="100%"><TR><TD WIDTH=300><B>Site ID:</B> 1</TD><TD WIDTH=300><B>Antenna ID:</B> 1         <TR><TD WIDTH=300><B>Manufacturer:</B> SCIENTIFIC-ATLANTA</TD><TD WIDTH=300><B>Diameter (meters):</B> 3.6<TR><TD WIDTH=300><B>Model:</B> MODEL 8136</TD><TD WIDTH=300><B>Diameter Minor (m):</B> 0<TR><TD WIDTH=300><B>Building Height AGL (m):</B> 0</TD><TD WIDTH=300><B>Diameter Major (m):</B> 0<TR><TD WIDTH=300><B>Max Antenna Height AGL (m):</B> 5.2</TD><TD WIDTH=300><B>Quantity:</B> 2<TR><TD WIDTH=300><B>Max Antenna Height AMSL (m):</B> 14.3</TD><TD WIDTH=300><B>Total Power (Watts):</B> 0<TR><TD WIDTH=300><B>Max Antenna Height Above Rooftop (m):</B> 0</TD><TD WIDTH=300><B>Total EIRP (dBW):</B> 0</TABLE></TD></TR></TABLE>

Here is my expression.

^.*?(\bSite\b.*?\<.*?)\<.*?(\bAntenna ID:\<.*?)\<.*?(\bManufacturer:\<.*?)\<.*?(\bDiameter\s.*?\<.*?)\<.*?(\bModel:\<.*?)\<.*?(\bDiameter\b.*?\<.*?)\<.*?(\bBuilding\b.*?\<.*?)\<.*?(\bDiameter\b.*?\<.*?)\<.*?(\bMax\b.*?\<.*?)\<.*?(\bQuantity\b.*?\<.*?)\<.*?(\bMax\b.*?\<.*?)\<.*?(\bTotal\b.*?\<.*?)\<.*$

It is pretty ugly. Each output includes the field and the data plus some unnecessary HTML tags. I figure a few additional parse steps are necessary to clean everything up. If there is an easier way I am open to it.

The problem is that the expression works with regex101.com or a desktop app like Patterns or RegExRX, but it does not work with the Alteryx RegEx tool (set to parse). The output fields are null. What am I doing wrong?

PhilipMannering · ‎10-31-2019

No idea why your regex doesn't work in Alteryx (to be honest I didn't really check).

But you might find my solution helpful. It's a couple of regex tools set to Tokenize, but I've tried to keep the number of characters down to a minimum,

hellyars · ‎11-01-2019

@PhilipMannering This is nice and clean. I will start exploring this technique. I have a lot of HTML source documents.

How do I get the first expression <TD.*?(?=<TD) to match the last <TD element in the example. It captures everything but Total EIRP (dBW):

PhilipMannering · ‎11-02-2019

Ah yes. You're absolutely right.

See attached.

Alteryx Designer Desktop Discussions

Parse HTML using RegEx

Zero to Advanced in 20 days

Re: Zero to Advanced in 20 days

Re: Zero to Advanced in 20 days

Re: Identify duplicates in a specific column, and ...

Re: Filter the last day of the month