Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Web scraping/regex

MTMjames
6 - Meteoroid

Hi all, 

 

I watched a webinar on this and thought it would be easy... I'm trying to download some data that is held in a table on a website. I have got as far as downloading the data, but am stuck on how to extract what I want out of that data. The source code looks like: 

<div>
 <table cellspacing="0" border="0" id="ctl00_MainContent_ucFinancialComplianceTable_gdvFinancialAndComplianceHistory" style="border-width:0px;border-style:None;border-collapse:collapse;">
 <tr>
 <th class="GridHeading" scope="col" style="width:146px;">Financial year end (FYE)</th><th class="GridHeading" scope="col" style="width:73px;">Income</th><th class="GridHeading" scope="col" style="width:73px;">Spending</th><th class="GridHeading" scope="col" style="width:146px;">Accounts received</th><th class="GridHeading" scope="col" style="width:146px;">Annual Return received</th><th class="GridHeading" scope="col" style="width:73px;">View</th>
 </tr><tr style="background-color:#FFFF99;">
 <td class="GridColumnText" style="width:146px;">31 Aug 2018</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,313,932</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,312,739</td><td class="GridColumnText" style="width:146px;">30 Jun 2019</td><td class="GridColumnText" style="width:146px;">30 Jun 2019</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2018 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20180831_E_C.pdf'>Accounts</a></td>
 </tr><tr style="background-color:#FFFFCC;">
 <td class="GridColumnText" style="width:146px;">31 Aug 2017</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,166,743</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,127,987</td><td class="GridColumnText" style="width:146px;">05 Jun 2018</td><td class="GridColumnText" style="width:146px;">05 Jun 2018</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2017 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20170831_E_C.pdf'>Accounts</a></td>
 </tr><tr style="background-color:#FFFF99;">
 <td class="GridColumnText" style="width:146px;">31 Aug 2016</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,169,515</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,159,754</td><td class="GridColumnText" style="width:146px;">13 Jun 2017*</td><td class="GridColumnText" style="width:146px;">13 Jun 2017</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2016 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20160831_E_C.pdf'>Accounts</a></td>
 </tr><tr style="background-color:#FFFFCC;">
 <td class="GridColumnText" style="width:146px;">31 Aug 2015</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,193,076</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,112,737</td><td class="GridColumnText" style="width:146px;">12 May 2016</td><td class="GridColumnText" style="width:146px;">26 May 2016</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2015 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20150831_E_C.pdf'>Accounts</a></td>
 </tr><tr style="background-color:#FFFF99;">
 <td class="GridColumnText" style="width:146px;">31 Aug 2014</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,129,065</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,021,361</td><td class="GridColumnText" style="width:146px;">06 Mar 2015</td><td class="GridColumnText" style="width:146px;">08 Jun 2015</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2014 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20140831_E_C.pdf'>Accounts</a></td>
 </tr>
 </table>
 </div

 

I want to extract the data in bold. they are DATE, INCOME and EXPENDITURE over 5 years. I've been trying to build a regular expression but to avail. Similarly, I've downloaded the data to a temp file and if I try and filter it does it on the tempfile name, not the content... Feel I'm a little out of my depth. Any pointers?

13 REPLIES 13
JordyMicheal
11 - Bolide

Hey @MTMjames,

 

Any chance you could like the website you're trying to scrape?


Thanks,

Jordan

DavidP
17 - Castor
17 - Castor

I modified the workflow found in this post slightly based on the data you provided and it came up with the results below.

 

https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Parsing-HTML-tables/td-p/25886

 

DavidP_0-1585008480310.png

 

MTMjames
6 - Meteoroid

The website is  https://apps.charitycommission.gov.uk/Showcharity/RegisterOfCharities/FinancialHistory.aspx?Register...

 

I have around 400 of these where the number is difference else the URL is the same. That bit I have sorted and all works, just not how to grab the data our. I have attached it here... 

DavidP
17 - Castor
17 - Castor

Also look at the solutions to weekly challenge 13 and 40 for more ideas on parsing html

MTMjames
6 - Meteoroid

That is brilliant, thanks. I'm new to the community and should have explained better! How do I plug that in to the rest of the flow?

DavidP
17 - Castor
17 - Castor

Ok, no worries. Leave it with me and I'll work on it this morning. Now that I can see what the data looks like, perhaps we can do something better.

DavidP
17 - Castor
17 - Castor

Here's a quick and dirty version for you to play with. I'll see if I can get the data out for you in a more elegant representation.

 

DavidP_0-1585046999447.png

 

DavidP
17 - Castor
17 - Castor

This is slightly cleaner I think. Since we're only looking to extract the Table data, I went looking for those rows and filtered the rest out early on. Also replaced &#163 (£ ascii code) with £ sign in the amounts. I added Charity No to show what each table relates to.

 

Let me know if you have any other questions

 

 

DavidP_0-1585049661122.png

 

MTMjames
6 - Meteoroid

That is totally amazing, thank you!!

Labels