I am new to Alteryx and trying to figure out how to parse html data. I have a number of txt files with html data in them and would like to extract information out of a number of files in a directory. The structure of the html within the each txt file looks like this:
"
^ class=""employee"">
<h2>
<a href=""/employee/name/bob-jackson"">bob jackson</a>
</h2>
<p>
2020 right street
<br/>Somewhere, US 30030
<br/>
(555) 555-5555 </p>
</div>
^ class=""employee"">
<h2>
<a href=""/employee/name/sal-roberts"">sal roberts</a>
</h2>
<p>
2021 right street
<br/>Somewhere, US 30030
<br/>
(555) 555-5556 </p>
</div>
"
I can extract the href full name by adding a regex expression like:
<a href.*?>(.*?)<\/a>
I am struggling with getting anything else to show within my expression.
Note: I pulled the txt files into my workspace by doing the following:
1. using input data tool
2. keeping defaults except changing the delimiter to \0
I am not sure what the best practice is for this? Thanks for the help!
Hi @cmartin5
I'm not a Regex expert but you can use Or to create as many columns as you need. Something like in the below example
Then you can use Multi row formula that can act as Excel auto fill so you can then filter the data to only get the record that has all parsed elements available.
Thank you for such a quick response! How would you pull the address, and get rid of the tags that are not needed?
Can you do the same thing with the input data connected to a txt file? The files I have are all txt files in a directory. I cannot connect my txt files with the format they are in using the input text tool... This is a huge help so far! Really appreciate all the effort so far.
Thanks @cmartin5 for the challenge :)
I can use Regex to get the 2nd part of the address parsed, then a Multi row formula to concatenate both. See below
Step1:
Step 2
Then you can use multi row formula with filter and unique to get to the below.
Looks promising! I am trying to understand all the steps and incorporate this with my data to see if I can duplicate the effort! Thank you for all the help!
If you had that html syntax saved out to a txt file in a folder, how would you do that? I have a number of txt files sitting out in a file directory that contains data like the format that I sent over. I am getting errors when trying to duplicate the solution. Getting really close! Thanks again for the help!
Hi @cmartin5
You can use Alteryx Directory followed by Dynamic Rename,
When you configure the template in Dynamic rename, choose "Read it ad a delimited text file" then select "\0" as delimiter.
Hope this helps.
Awesome! I was able to figure out the directory and the regex syntax to get the columns needed... Now to figure out the multi-formula and beyond that you posted. I had to modify the syntax a good bit to work with my dataset versus the bogus data I placed in the post, but I am starting to understand everything. Thank you for all the help!
User | Count |
---|---|
106 | |
82 | |
70 | |
54 | |
40 |