Hi all,
I watched a webinar on this and thought it would be easy... I'm trying to download some data that is held in a table on a website. I have got as far as downloading the data, but am stuck on how to extract what I want out of that data. The source code looks like:
<div> | |
<table cellspacing="0" border="0" id="ctl00_MainContent_ucFinancialComplianceTable_gdvFinancialAndComplianceHistory" style="border-width:0px;border-style:None;border-collapse:collapse;"> | |
<tr> | |
<th class="GridHeading" scope="col" style="width:146px;">Financial year end (FYE)</th><th class="GridHeading" scope="col" style="width:73px;">Income</th><th class="GridHeading" scope="col" style="width:73px;">Spending</th><th class="GridHeading" scope="col" style="width:146px;">Accounts received</th><th class="GridHeading" scope="col" style="width:146px;">Annual Return received</th><th class="GridHeading" scope="col" style="width:73px;">View</th> | |
</tr><tr style="background-color:#FFFF99;"> | |
<td class="GridColumnText" style="width:146px;">31 Aug 2018</td><td class="GridColumnNumeric" style="width:73px;">£1,313,932</td><td class="GridColumnNumeric" style="width:73px;">£1,312,739</td><td class="GridColumnText" style="width:146px;">30 Jun 2019</td><td class="GridColumnText" style="width:146px;">30 Jun 2019</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2018 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20180831_E_C.pdf'>Accounts</a></td> | |
</tr><tr style="background-color:#FFFFCC;"> | |
<td class="GridColumnText" style="width:146px;">31 Aug 2017</td><td class="GridColumnNumeric" style="width:73px;">£1,166,743</td><td class="GridColumnNumeric" style="width:73px;">£1,127,987</td><td class="GridColumnText" style="width:146px;">05 Jun 2018</td><td class="GridColumnText" style="width:146px;">05 Jun 2018</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2017 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20170831_E_C.pdf'>Accounts</a></td> | |
</tr><tr style="background-color:#FFFF99;"> | |
<td class="GridColumnText" style="width:146px;">31 Aug 2016</td><td class="GridColumnNumeric" style="width:73px;">£1,169,515</td><td class="GridColumnNumeric" style="width:73px;">£1,159,754</td><td class="GridColumnText" style="width:146px;">13 Jun 2017*</td><td class="GridColumnText" style="width:146px;">13 Jun 2017</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2016 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20160831_E_C.pdf'>Accounts</a></td> | |
</tr><tr style="background-color:#FFFFCC;"> | |
<td class="GridColumnText" style="width:146px;">31 Aug 2015</td><td class="GridColumnNumeric" style="width:73px;">£1,193,076</td><td class="GridColumnNumeric" style="width:73px;">£1,112,737</td><td class="GridColumnText" style="width:146px;">12 May 2016</td><td class="GridColumnText" style="width:146px;">26 May 2016</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2015 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20150831_E_C.pdf'>Accounts</a></td> | |
</tr><tr style="background-color:#FFFF99;"> | |
<td class="GridColumnText" style="width:146px;">31 Aug 2014</td><td class="GridColumnNumeric" style="width:73px;">£1,129,065</td><td class="GridColumnNumeric" style="width:73px;">£1,021,361</td><td class="GridColumnText" style="width:146px;">06 Mar 2015</td><td class="GridColumnText" style="width:146px;">08 Jun 2015</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2014 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20140831_E_C.pdf'>Accounts</a></td> | |
</tr> | |
</table> | |
</div |
I want to extract the data in bold. they are DATE, INCOME and EXPENDITURE over 5 years. I've been trying to build a regular expression but to avail. Similarly, I've downloaded the data to a temp file and if I try and filter it does it on the tempfile name, not the content... Feel I'm a little out of my depth. Any pointers?
Solved! Go to Solution.
I modified the workflow found in this post slightly based on the data you provided and it came up with the results below.
https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Parsing-HTML-tables/td-p/25886
The website is https://apps.charitycommission.gov.uk/Showcharity/RegisterOfCharities/FinancialHistory.aspx?Register...
I have around 400 of these where the number is difference else the URL is the same. That bit I have sorted and all works, just not how to grab the data our. I have attached it here...
Also look at the solutions to weekly challenge 13 and 40 for more ideas on parsing html
Ok, no worries. Leave it with me and I'll work on it this morning. Now that I can see what the data looks like, perhaps we can do something better.
This is slightly cleaner I think. Since we're only looking to extract the Table data, I went looking for those rows and filtered the rest out early on. Also replaced £ (£ ascii code) with £ sign in the amounts. I added Charity No to show what each table relates to.
Let me know if you have any other questions
That is totally amazing, thank you!!