Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.
Alteryx is here to help you solve your biggest data challenges. Read about the new Virtual Solution Center here.
SOLVED

Web scraping/regex

Highlighted
6 - Meteoroid

Hi all, 

 

I watched a webinar on this and thought it would be easy... I'm trying to download some data that is held in a table on a website. I have got as far as downloading the data, but am stuck on how to extract what I want out of that data. The source code looks like: 

<div>
 <table cellspacing="0" border="0" id="ctl00_MainContent_ucFinancialComplianceTable_gdvFinancialAndComplianceHistory" style="border-width:0px;border-style:None;border-collapse:collapse;">
 <tr>
 <th class="GridHeading" scope="col" style="width:146px;">Financial year end (FYE)</th><th class="GridHeading" scope="col" style="width:73px;">Income</th><th class="GridHeading" scope="col" style="width:73px;">Spending</th><th class="GridHeading" scope="col" style="width:146px;">Accounts received</th><th class="GridHeading" scope="col" style="width:146px;">Annual Return received</th><th class="GridHeading" scope="col" style="width:73px;">View</th>
 </tr><tr style="background-color:#FFFF99;">
 <td class="GridColumnText" style="width:146px;">31 Aug 2018</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,313,932</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,312,739</td><td class="GridColumnText" style="width:146px;">30 Jun 2019</td><td class="GridColumnText" style="width:146px;">30 Jun 2019</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2018 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20180831_E_C.pdf'>Accounts</a></td>
 </tr><tr style="background-color:#FFFFCC;">
 <td class="GridColumnText" style="width:146px;">31 Aug 2017</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,166,743</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,127,987</td><td class="GridColumnText" style="width:146px;">05 Jun 2018</td><td class="GridColumnText" style="width:146px;">05 Jun 2018</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2017 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20170831_E_C.pdf'>Accounts</a></td>
 </tr><tr style="background-color:#FFFF99;">
 <td class="GridColumnText" style="width:146px;">31 Aug 2016</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,169,515</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,159,754</td><td class="GridColumnText" style="width:146px;">13 Jun 2017*</td><td class="GridColumnText" style="width:146px;">13 Jun 2017</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2016 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20160831_E_C.pdf'>Accounts</a></td>
 </tr><tr style="background-color:#FFFFCC;">
 <td class="GridColumnText" style="width:146px;">31 Aug 2015</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,193,076</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,112,737</td><td class="GridColumnText" style="width:146px;">12 May 2016</td><td class="GridColumnText" style="width:146px;">26 May 2016</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2015 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20150831_E_C.pdf'>Accounts</a></td>
 </tr><tr style="background-color:#FFFF99;">
 <td class="GridColumnText" style="width:146px;">31 Aug 2014</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,129,065</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,021,361</td><td class="GridColumnText" style="width:146px;">06 Mar 2015</td><td class="GridColumnText" style="width:146px;">08 Jun 2015</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2014 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20140831_E_C.pdf'>Accounts</a></td>
 </tr>
 </table>
 </div

 

I want to extract the data in bold. they are DATE, INCOME and EXPENDITURE over 5 years. I've been trying to build a regular expression but to avail. Similarly, I've downloaded the data to a temp file and if I try and filter it does it on the tempfile name, not the content... Feel I'm a little out of my depth. Any pointers?

Highlighted
Alteryx Certified Partner

Hey @MTMjames,

 

Any chance you could like the website you're trying to scrape?


Thanks,

Jordan

14 - Magnetar

I modified the workflow found in this post slightly based on the data you provided and it came up with the results below.

 

https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Parsing-HTML-tables/td-p/25886

 

DavidP_0-1585008480310.png

 

Highlighted
6 - Meteoroid

The website is  https://apps.charitycommission.gov.uk/Showcharity/RegisterOfCharities/FinancialHistory.aspx?Register...

 

I have around 400 of these where the number is difference else the URL is the same. That bit I have sorted and all works, just not how to grab the data our. I have attached it here... 

Highlighted
14 - Magnetar

Also look at the solutions to weekly challenge 13 and 40 for more ideas on parsing html

Highlighted
6 - Meteoroid

That is brilliant, thanks. I'm new to the community and should have explained better! How do I plug that in to the rest of the flow?

Highlighted
14 - Magnetar

Ok, no worries. Leave it with me and I'll work on it this morning. Now that I can see what the data looks like, perhaps we can do something better.

Highlighted
14 - Magnetar

Here's a quick and dirty version for you to play with. I'll see if I can get the data out for you in a more elegant representation.

 

DavidP_0-1585046999447.png

 

Highlighted
14 - Magnetar

This is slightly cleaner I think. Since we're only looking to extract the Table data, I went looking for those rows and filtered the rest out early on. Also replaced &#163 (£ ascii code) with £ sign in the amounts. I added Charity No to show what each table relates to.

 

Let me know if you have any other questions

 

 

DavidP_0-1585049661122.png

 

Highlighted
6 - Meteoroid

That is totally amazing, thank you!!

Labels