Alteryx Designer Desktop Discussions

hs14428 · ‎10-23-2019

Hi Alteryx Community,

To give some context, I am trying to extract a number of strings (around 10 currently) from a long webpage HTML within Alteryx. Currently a colleague of mine has a Python web scraper in place, but I am exploring implementing the web scraper within Alteryx. I am completely new to Regular Expressions and have been trying to teach myself and understand how it works. Using the community and google, I have come up with an expression that isolates what I want for the first time the HTML string meets the expression criteria.

Please could someone assist me in repeating the RegExOut to include all the times the string matches the RegEx?

The HTML follows the following format:

<div id="total-display-traded" class="total-display">
<p class="spacer-top">Most traded funds (total buys and sells) from the past week:</p>
<div class="row"><div class="large-6 medium-6 columns"><ul class="list-standard-styled list-blue-dot li-no-indent li-spacer-dbl">
<li><a href="https://www.fakeurl.co.uk/search-results/B41YBW7" title="Important Information That I Require" class="link-headline">Pretty much the same important information I require </a></li>

My RegEx is: .*<a href="(.*?)"\s*\.*title="(.*?)" and I am using the Parse output method. I understand output method of Tokenize might be what I need, but as a noob, I am haven't managed to do this yet.

In the HTML, the first string which is in bold and italics is a unique locator for where the required information lies. This exact line appears 4 times in the entire HTML code and is what my colleagues Python web scraper is using to find the required information.

The second string in bold and italics is the information that I require. My RegEx gives me a URL just before the required information in one column, and the required Title information in the next column. The issue is repeating the RegEx output to capture all times the criteria is met, as there are times the expression captures a link and a title before the required information is reached in the HTML. Then I can more easily work with the output and filter out the 10 or so strings I require. Currently the RegEx captures the first link and title, which I am not interested in, and stops there.

If it is any help, the below HTML is the last segment where information is taken from. I have highlighted the last line, anything beyond this point is of no use; everything between the first HTML block, and the below HTML block contains the required information (in the same format).

<li><a href="https://fakeurl.co.uk/search-results/BJBQC36" title="Important Information That I Require" class="link-headline">Pretty much the same important information I require </a></li>
</div></ul></div></div>
<div id="total-display-trackers" class="total-display">

Hopefully this is clear, and any help/pointers will be greatly appreciated!

Many thanks,

Harry

T_Willins · ‎10-23-2019

Hi @hs14428

Web scraping can be tough. If you get a chance to take web scraping at #Inspire I would highly recommend it. Looking at what you are trying to get to, you might try first using a RegEx tokenize of <li>.*?</li> Then add another RegEx tool and parse the results of the first RegEx tool using your expression .*<a href="(.*?)"\s*\.*title="(.*?)" Without seeing your actual results your expression may be too greedy between your first and second marked groups.