Bring your best ideas to the AI Use Case Contest! Enter to win 40 hours of expert engineering support and bring your vision to life using the powerful combination of Alteryx + AI. Learn more now, or go straight to the submission form.
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Parsing HTML REGEX

cmartin5
6 - Meteoroid

I am new to Alteryx and trying to figure out how to parse html data.  I have a number of txt files with html data in them and would like to extract information out of a number of files in a directory.  The structure of the html within the each txt file looks like this: 

 

"
^ class=""employee"">
<h2>

<a href=""/employee/name/bob-jackson"">bob jackson</a>
</h2>

<p>
2020 right street
<br/>Somewhere, US 30030
<br/>
(555) 555-5555 </p>
</div>
^ class=""employee"">
<h2>

<a href=""/employee/name/sal-roberts"">sal roberts</a>
</h2>

<p>
2021 right street
<br/>Somewhere, US 30030
<br/>
(555) 555-5556 </p>
</div>
"

I can extract the href full name by adding a regex expression like: 

<a href.*?>(.*?)<\/a>

 

I am struggling with getting anything else to show within my expression.  

 

Note:  I pulled the txt files into my workspace by doing the following: 

1. using input data tool 

2. keeping defaults except changing the delimiter to \0

 

I am not sure what the best practice is for this? Thanks for the help!

10 REPLIES 10

Hi @cmartin5 

 

I'm not a Regex expert but you can use Or to create as many columns as you need. Something like in the below example

 

Then you can use Multi row formula that can act as Excel auto fill so you can then filter the data to only get the record that has all parsed elements available.

christine_assaad_0-1658347508590.png

 

cmartin5
6 - Meteoroid

Thank you for such a quick response!  How would you pull the address, and get rid of the tags that are not needed? 

 

dougperez
12 - Quasar

Hello!

 

I used the replace in formula tool just to "aggregate" the infos in "< class=""employee"">" after that I used some regex to tabulate the data. 

dougperez_0-1658347841361.png

 

cmartin5
6 - Meteoroid

Can you do the same thing with the input data connected to a txt file?  The files I have are all txt files in a directory.  I cannot connect my txt files with the format they are in using the input text tool...  This is a huge help so far!  Really appreciate all the effort so far. 

Thanks @cmartin5  for the challenge :)

 

I can use Regex to get the 2nd part of the address parsed, then a Multi row formula to concatenate both. See below

 

Step1:

christine_assaad_0-1658348629581.png

 

Step 2

christine_assaad_1-1658348669258.png

 

Then you can use multi row formula with filter and unique to get to the below.

christine_assaad_2-1658348714126.png

 

cmartin5
6 - Meteoroid

Looks promising!  I am trying to understand all the steps and incorporate this with my data to see if I can duplicate the effort!  Thank you for all the help!  

cmartin5
6 - Meteoroid

If you had that html syntax saved out to a txt file in a folder, how would you do that?  I have a number of txt files sitting out in a file directory that contains data like the format that I sent over.  I am getting errors when trying to duplicate the solution.  Getting really close!  Thanks again for the help!

 

Hi @cmartin5 

 

You can use Alteryx Directory followed by Dynamic Rename,

christine_assaad_0-1658350864288.png

 

When you configure the template in Dynamic rename, choose "Read it ad a delimited text file" then select "\0" as delimiter.

christine_assaad_1-1658350956102.png

 

Hope this helps.

cmartin5
6 - Meteoroid

Awesome!  I was able to figure out the directory and the regex syntax to get the columns needed... Now to figure out the multi-formula and beyond that you posted.  I had to modify the syntax a good bit to work with my dataset versus the bogus data I placed in the post, but I am starting to understand everything.  Thank you for all the help!

Labels
Top Solution Authors