I am trying to extract fields and data from an HTML field. The HTML field is itself a table.
Here is the target field.
</TR><BR><TABLE BORDER=2 WIDTH="100%"><TR><TD><TABLE BORDER=0 WIDTH="100%"><TR><TD WIDTH=300><B>Site ID:</B> 1</TD><TD WIDTH=300><B>Antenna ID:</B> 1 <TR><TD WIDTH=300><B>Manufacturer:</B> SCIENTIFIC-ATLANTA</TD><TD WIDTH=300><B>Diameter (meters):</B> 3.6<TR><TD WIDTH=300><B>Model:</B> MODEL 8136</TD><TD WIDTH=300><B>Diameter Minor (m):</B> 0<TR><TD WIDTH=300><B>Building Height AGL (m):</B> 0</TD><TD WIDTH=300><B>Diameter Major (m):</B> 0<TR><TD WIDTH=300><B>Max Antenna Height AGL (m):</B> 5.2</TD><TD WIDTH=300><B>Quantity:</B> 2<TR><TD WIDTH=300><B>Max Antenna Height AMSL (m):</B> 14.3</TD><TD WIDTH=300><B>Total Power (Watts):</B> 0<TR><TD WIDTH=300><B>Max Antenna Height Above Rooftop (m):</B> 0</TD><TD WIDTH=300><B>Total EIRP (dBW):</B> 0</TABLE></TD></TR></TABLE>
Here is my expression.
^.*?(\bSite\b.*?\<.*?)\<.*?(\bAntenna ID:\<.*?)\<.*?(\bManufacturer:\<.*?)\<.*?(\bDiameter\s.*?\<.*?)\<.*?(\bModel:\<.*?)\<.*?(\bDiameter\b.*?\<.*?)\<.*?(\bBuilding\b.*?\<.*?)\<.*?(\bDiameter\b.*?\<.*?)\<.*?(\bMax\b.*?\<.*?)\<.*?(\bQuantity\b.*?\<.*?)\<.*?(\bMax\b.*?\<.*?)\<.*?(\bTotal\b.*?\<.*?)\<.*$
It is pretty ugly. Each output includes the field and the data plus some unnecessary HTML tags. I figure a few additional parse steps are necessary to clean everything up. If there is an easier way I am open to it.
The problem is that the expression works with regex101.com or a desktop app like Patterns or RegExRX, but it does not work with the Alteryx RegEx tool (set to parse). The output fields are null. What am I doing wrong?
Solved! Go to Solution.
@PhilipMannering This is nice and clean. I will start exploring this technique. I have a lot of HTML source documents.
How do I get the first expression <TD.*?(?=<TD) to match the last <TD element in the example. It captures everything but Total EIRP (dBW):