Important Community update: The process for changing your account details was updated on June 25th. Learn how this impacts your Community experience and the actions we suggest you take to secure your account here.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Data Parsing from Column of HTML Code

AConcernedAnalyst
7 - Meteor

Hi all,

 

Hope you are well. I have a column (called regiontables) of data that appears to have HTML/XML (not entirely sure which one) of a table in each record. My goal is to extract the bolded parts and create a table from this for each record. I have mainly been trying utilize beautifulsoup within the python tool, but I am not really having much luck (code posted below).  I went straight to this approach because this is what google led me to, but I am open to any ways to solve this problem at this point. Any help would be extremely welcomed!

 

For example (dummy data):

 

record 1 for region 1 has this:

 

<TABLE cellSpacing=0 cellPadding=0 width="95%" border=1>
<TBODY>
<TR>
<TH bgColor=#cfdff4 width="11%"><FONT size=2 face=Calibri>Retailer</FONT></TH>
<TH bgColor=#cfdff4 width="11%"><FONT size=2 face=Calibri>Order No.</FONT></TH>
<TH bgColor=#cfdff4 width="11%"><FONT size=2 face=Calibri>Invoice Number</FONT></TH>
<TH bgColor=#cfdff4 width="13%"><FONT size=2 face=Calibri>Contact Date</FONT></TH>
<TH bgColor=#cfdff4 width="13%"><FONT size=2 face=Calibri>Count</FONT></TH>
<TH bgColor=#cfdff4 width="42%"><FONT size=2 face=Calibri>Item Description Description</FONT></TH></TR>
<TR>
<TD align=center><FONT size=2 face=Calibri>CA123</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>12345</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>147249</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>20151221</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>50</FONT></TD>
<TD><FONT size=2 face=Calibri>[51423] </FONT></TD></TR>
<TR>
<TD align=center><FONT size=2 face=Calibri>BB174</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>47119B</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>141149</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>20151208</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>50</FONT></TD>
<TD><FONT size=2 face=Calibri>[73892] 5th Gen &amp; Ipad Cover</FONT></TD></TR>
<TR>
<TD align=center><FONT size=2 face=Calibri>MN149</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>47233A</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>147209</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>20151023</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>50</FONT></TD>
<TD><FONT size=2 face=Calibri>[378293] Xbox Controller</FONT></TD></TR></TBODY></TABLE>

 

record 2 for region 2:

<TABLE cellSpacing=0 cellPadding=0 width="95%" border=1>
<TBODY>
<TR>
<TH bgColor=#cfdff4 width="11%"><FONT size=2 face=Calibri>Retailer</FONT></TH>
<TH bgColor=#cfdff4 width="11%"><FONT size=2 face=Calibri>Order No.</FONT></TH>
<TH bgColor=#cfdff4 width="11%"><FONT size=2 face=Calibri>Invoice Number</FONT></TH>
<TH bgColor=#cfdff4 width="13%"><FONT size=2 face=Calibri>Contact Date</FONT></TH>
<TH bgColor=#cfdff4 width="13%"><FONT size=2 face=Calibri>Count</FONT></TH>
<TH bgColor=#cfdff4 width="42%"><FONT size=2 face=Calibri>Item Description Description</FONT></TH></TR>
<TR>
<TD align=center><FONT size=2 face=Calibri>CA555</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>534234</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>54678</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>20151221</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>50</FONT></TD>
<TD><FONT size=2 face=Calibri>[2123] </FONT></TD></TR>
<TR>
<TD align=center><FONT size=2 face=Calibri>BB174</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>47119B</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>141149</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>20151208</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>50</FONT></TD>
<TD><FONT size=2 face=Calibri>[73892] 5th Gen &amp; Ipad Cover</FONT></TD></TR>
<TR>
<TD align=center><FONT size=2 face=Calibri>MN149</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>47233A</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>147209</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>20151023</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>50</FONT></TD>
<TD><FONT size=2 face=Calibri>[378293] PS5 Controller</FONT></TD></TR></TBODY></TABLE>

 

 

Python Code

 

# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
Package.installPackages(['pandas','numpy','bs4','lxml'])

from ayx import Alteryx

import pandas as pd
import numpy as np

 

df = Alteryx.read('#1')

df.shape

df.columns

desired_cols = ['REGION_TABLES]

df = df[desired_cols]

 

#Text parsing
from bs4 import BeautifulSoup as bs

soup = df.applymap(lambda df: bs(df).find('tbody'))

 

It is at this point where I get an error and this is as far as I could get

 

NameError                                 Traceback (most recent call last)
<ipython-input-2-70a343cf1c6e> in <module>
      1 #Find table
      2 
----> 3 soup = df.applymap(lambda df: bs(df).find('tbody'))
      4 
      5 print(soup)

NameError: name 'df' is not defined

 

Thank you!

8 REPLIES 8
AkimasaKajitani
17 - Castor
17 - Castor

Hi @AConcernedAnalyst ,

 

I think there is a problem at bellow point( After the REGION_TABLE there is needed the single quotation.).

desired_cols = ['REGION_TABLES]

 

But if it is the cause of this problem, Python tool shows the error at this point.

 

 

This is the workable code.

from ayx import Alteryx
import pandas as pd
import numpy as np


df = Alteryx.read("#1")
df.shape
df.columns
desired_cols = ['REGION_TABLES']
df = df[desired_cols]


#Text parsing
from bs4 import BeautifulSoup as bs

soup = df.applymap(lambda df: bs(df).find('tbody'))

 

flying008
14 - Magnetar

Hi,@AConcernedAnalyst 

 

There is a so boring way for you.

 

录制_2022_04_23_16_10_14_520.gif

AConcernedAnalyst
7 - Meteor

Hi @flying008 

 

Would you be able to share the regex that was used to parse out the details from each row?

flying008
14 - Magnetar

Hi,@AConcernedAnalyst 

 

Regex:

1- Regular Expression: (=Calibri>)(.+)(<\/FONT>)

2- Output Method: Replace

3- Replacement Text: $2

4- Don't check 'Copy unmatched text to output.'

AConcernedAnalyst
7 - Meteor

@flying008 

 

If possible, would you be able to post the WF of your solution? I am kind of lost on what the settings for some of the tools look.

flying008
14 - Magnetar

Hi, @AConcernedAnalyst 

 

Unfortunately I can't provide WF due to compliance rules.

AConcernedAnalyst
7 - Meteor

Hi @flying008 

 

Understood, no problem. 

 

I was able to sort of get to the results I am looking for (like 85% of the way there) after using your posts, but did some things slightly different from the way you did things. I wonder if I can get an extra 5-10% if I follow steps more like you did. 

 

Could you please let me know the configurations for the following tools?

- First text to columns tool ( I think it generates rows based off of line breaks \n)

- Dynamic select tool

 

 

 

 

 

flying008
14 - Magnetar

Hi,@AConcernedAnalyst 

 

please see that.

 

录制_2022_05_07_08_25_58_854.gif

Labels