Hi all,
Hope you are well. I have a column (called regiontables) of data that appears to have HTML/XML (not entirely sure which one) of a table in each record. My goal is to extract the bolded parts and create a table from this for each record. I have mainly been trying utilize beautifulsoup within the python tool, but I am not really having much luck (code posted below). I went straight to this approach because this is what google led me to, but I am open to any ways to solve this problem at this point. Any help would be extremely welcomed!
For example (dummy data):
record 1 for region 1 has this:
<TABLE cellSpacing=0 cellPadding=0 width="95%" border=1>
<TBODY>
<TR>
<TH bgColor=#cfdff4 width="11%"><FONT size=2 face=Calibri>Retailer</FONT></TH>
<TH bgColor=#cfdff4 width="11%"><FONT size=2 face=Calibri>Order No.</FONT></TH>
<TH bgColor=#cfdff4 width="11%"><FONT size=2 face=Calibri>Invoice Number</FONT></TH>
<TH bgColor=#cfdff4 width="13%"><FONT size=2 face=Calibri>Contact Date</FONT></TH>
<TH bgColor=#cfdff4 width="13%"><FONT size=2 face=Calibri>Count</FONT></TH>
<TH bgColor=#cfdff4 width="42%"><FONT size=2 face=Calibri>Item Description Description</FONT></TH></TR>
<TR>
<TD align=center><FONT size=2 face=Calibri>CA123</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>12345</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>147249</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>20151221</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>50</FONT></TD>
<TD><FONT size=2 face=Calibri>[51423] </FONT></TD></TR>
<TR>
<TD align=center><FONT size=2 face=Calibri>BB174</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>47119B</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>141149</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>20151208</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>50</FONT></TD>
<TD><FONT size=2 face=Calibri>[73892] 5th Gen & Ipad Cover</FONT></TD></TR>
<TR>
<TD align=center><FONT size=2 face=Calibri>MN149</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>47233A</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>147209</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>20151023</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>50</FONT></TD>
<TD><FONT size=2 face=Calibri>[378293] Xbox Controller</FONT></TD></TR></TBODY></TABLE>
record 2 for region 2:
<TABLE cellSpacing=0 cellPadding=0 width="95%" border=1>
<TBODY>
<TR>
<TH bgColor=#cfdff4 width="11%"><FONT size=2 face=Calibri>Retailer</FONT></TH>
<TH bgColor=#cfdff4 width="11%"><FONT size=2 face=Calibri>Order No.</FONT></TH>
<TH bgColor=#cfdff4 width="11%"><FONT size=2 face=Calibri>Invoice Number</FONT></TH>
<TH bgColor=#cfdff4 width="13%"><FONT size=2 face=Calibri>Contact Date</FONT></TH>
<TH bgColor=#cfdff4 width="13%"><FONT size=2 face=Calibri>Count</FONT></TH>
<TH bgColor=#cfdff4 width="42%"><FONT size=2 face=Calibri>Item Description Description</FONT></TH></TR>
<TR>
<TD align=center><FONT size=2 face=Calibri>CA555</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>534234</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>54678</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>20151221</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>50</FONT></TD>
<TD><FONT size=2 face=Calibri>[2123] </FONT></TD></TR>
<TR>
<TD align=center><FONT size=2 face=Calibri>BB174</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>47119B</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>141149</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>20151208</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>50</FONT></TD>
<TD><FONT size=2 face=Calibri>[73892] 5th Gen & Ipad Cover</FONT></TD></TR>
<TR>
<TD align=center><FONT size=2 face=Calibri>MN149</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>47233A</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>147209</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>20151023</FONT></TD>
<TD align=center><FONT size=2 face=Calibri>50</FONT></TD>
<TD><FONT size=2 face=Calibri>[378293] PS5 Controller</FONT></TD></TR></TBODY></TABLE>
Python Code
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
Package.installPackages(['pandas','numpy','bs4','lxml'])
from ayx import Alteryx
import pandas as pd
import numpy as np
df = Alteryx.read('#1')
df.shape
df.columns
desired_cols = ['REGION_TABLES]
df = df[desired_cols]
#Text parsing
from bs4 import BeautifulSoup as bs
soup = df.applymap(lambda df: bs(df).find('tbody'))
It is at this point where I get an error and this is as far as I could get
NameError Traceback (most recent call last) <ipython-input-2-70a343cf1c6e> in <module> 1 #Find table 2 ----> 3 soup = df.applymap(lambda df: bs(df).find('tbody')) 4 5 print(soup) NameError: name 'df' is not defined
Thank you!
Solved! Go to Solution.
Hi @AConcernedAnalyst ,
I think there is a problem at bellow point( After the REGION_TABLE there is needed the single quotation.).
desired_cols = ['REGION_TABLES]
But if it is the cause of this problem, Python tool shows the error at this point.
This is the workable code.
from ayx import Alteryx
import pandas as pd
import numpy as np
df = Alteryx.read("#1")
df.shape
df.columns
desired_cols = ['REGION_TABLES']
df = df[desired_cols]
#Text parsing
from bs4 import BeautifulSoup as bs
soup = df.applymap(lambda df: bs(df).find('tbody'))
Hi @flying008
Would you be able to share the regex that was used to parse out the details from each row?
Regex:
1- Regular Expression: (=Calibri>)(.+)(<\/FONT>)
2- Output Method: Replace
3- Replacement Text: $2
4- Don't check 'Copy unmatched text to output.'
If possible, would you be able to post the WF of your solution? I am kind of lost on what the settings for some of the tools look.
Hi @flying008
Understood, no problem.
I was able to sort of get to the results I am looking for (like 85% of the way there) after using your posts, but did some things slightly different from the way you did things. I wonder if I can get an extra 5-10% if I follow steps more like you did.
Could you please let me know the configurations for the following tools?
- First text to columns tool ( I think it generates rows based off of line breaks \n)
- Dynamic select tool