Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Web scraping/regex

MTMjames
6 - Meteoroid

Hi all, 

 

I watched a webinar on this and thought it would be easy... I'm trying to download some data that is held in a table on a website. I have got as far as downloading the data, but am stuck on how to extract what I want out of that data. The source code looks like: 

<div>
 <table cellspacing="0" border="0" id="ctl00_MainContent_ucFinancialComplianceTable_gdvFinancialAndComplianceHistory" style="border-width:0px;border-style:None;border-collapse:collapse;">
 <tr>
 <th class="GridHeading" scope="col" style="width:146px;">Financial year end (FYE)</th><th class="GridHeading" scope="col" style="width:73px;">Income</th><th class="GridHeading" scope="col" style="width:73px;">Spending</th><th class="GridHeading" scope="col" style="width:146px;">Accounts received</th><th class="GridHeading" scope="col" style="width:146px;">Annual Return received</th><th class="GridHeading" scope="col" style="width:73px;">View</th>
 </tr><tr style="background-color:#FFFF99;">
 <td class="GridColumnText" style="width:146px;">31 Aug 2018</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,313,932</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,312,739</td><td class="GridColumnText" style="width:146px;">30 Jun 2019</td><td class="GridColumnText" style="width:146px;">30 Jun 2019</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2018 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20180831_E_C.pdf'>Accounts</a></td>
 </tr><tr style="background-color:#FFFFCC;">
 <td class="GridColumnText" style="width:146px;">31 Aug 2017</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,166,743</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,127,987</td><td class="GridColumnText" style="width:146px;">05 Jun 2018</td><td class="GridColumnText" style="width:146px;">05 Jun 2018</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2017 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20170831_E_C.pdf'>Accounts</a></td>
 </tr><tr style="background-color:#FFFF99;">
 <td class="GridColumnText" style="width:146px;">31 Aug 2016</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,169,515</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,159,754</td><td class="GridColumnText" style="width:146px;">13 Jun 2017*</td><td class="GridColumnText" style="width:146px;">13 Jun 2017</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2016 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20160831_E_C.pdf'>Accounts</a></td>
 </tr><tr style="background-color:#FFFFCC;">
 <td class="GridColumnText" style="width:146px;">31 Aug 2015</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,193,076</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,112,737</td><td class="GridColumnText" style="width:146px;">12 May 2016</td><td class="GridColumnText" style="width:146px;">26 May 2016</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2015 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20150831_E_C.pdf'>Accounts</a></td>
 </tr><tr style="background-color:#FFFF99;">
 <td class="GridColumnText" style="width:146px;">31 Aug 2014</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,129,065</td><td class="GridColumnNumeric" style="width:73px;">&#163;1,021,361</td><td class="GridColumnText" style="width:146px;">06 Mar 2015</td><td class="GridColumnText" style="width:146px;">08 Jun 2015</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2014 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20140831_E_C.pdf'>Accounts</a></td>
 </tr>
 </table>
 </div

 

I want to extract the data in bold. they are DATE, INCOME and EXPENDITURE over 5 years. I've been trying to build a regular expression but to avail. Similarly, I've downloaded the data to a temp file and if I try and filter it does it on the tempfile name, not the content... Feel I'm a little out of my depth. Any pointers?

13 REPLIES 13
MTMjames
6 - Meteoroid

I must admit I'm totally impressed with that but can't work out how you did it! I'm trying to get some more data from the same site... how would I adapt that flow to pull the postcode and the name of the Chair?

 

In this page: 

https://apps.charitycommission.gov.uk/Showcharity/RegisterOfCharities/ContactAndTrustees.aspx?Regist...

 

I'm looking to pull SP10 3AL and CAROLE MACHIN. The email of sbench@rookwoodschool.org would be helpful too, but I guess that breaks some rules. 

 

Thanks again

DavidP
17 - Castor
17 - Castor

To make the workflow a bit easier to understand, I've added some documentation to the original workflow. I've also duplicated the workflow to show how to retrieve the address and trustees info from the new URL

 

Hope you find this helpful.

 

DavidP_0-1585177814701.png

 

MTMjames
6 - Meteoroid

Ah, I think I understand it more now. Could you put each of the outputs per URN into its own Column? so the postcode and Key person have their own columns? That will help me reference them against other data. Hope thats ok! I've been working with

 

267174https://apps.charitycommission.gov.uk/Showcharity/RegisterOfCharities/FinancialHistory.aspx?Register...
268482https://apps.charitycommission.gov.uk/Showcharity/RegisterOfCharities/FinancialHistory.aspx?Register...

 

Thanks!

DavidP
17 - Castor
17 - Castor

Firstly, let's make the input a bit more dynamic so that you just enter the URN and Charity No and the URL is built dynamically.

 

Then you have isolate the postcode row and the last row for each set and join them to the 1st data set, like this

 

DavidP_0-1585213699368.png

 

Labels