I am new to Alteryx and trying to figure out how to parse html data. I have a number of txt files with html data in them and would like to extract information out of a number of files in a directory. The structure of the html within the each txt file looks like this:
"
^ class=""employee"">
<h2>
<a href=""/employee/name/bob-jackson"">bob jackson</a>
</h2>
<p>
2020 right street
<br/>Somewhere, US 30030
<br/>
(555) 555-5555 </p>
</div>
^ class=""employee"">
<h2>
<a href=""/employee/name/sal-roberts"">sal roberts</a>
</h2>
<p>
2021 right street
<br/>Somewhere, US 30030
<br/>
(555) 555-5556 </p>
</div>
"
I can extract the href full name by adding a regex expression like:
<a href.*?>(.*?)<\/a>
I am struggling with getting anything else to show within my expression.
Note: I pulled the txt files into my workspace by doing the following:
1. using input data tool
2. keeping defaults except changing the delimiter to \0
I am not sure what the best practice is for this? Thanks for the help!
Thanks again, I was able to get this sorted just like I wanted. Really appreciate your example, that helped me figure everything out.
User | Count |
---|---|
106 | |
82 | |
70 | |
54 | |
40 |