Hi all,
I watched a webinar on this and thought it would be easy... I'm trying to download some data that is held in a table on a website. I have got as far as downloading the data, but am stuck on how to extract what I want out of that data. The source code looks like:
| <div> |
| | <table cellspacing="0" border="0" id="ctl00_MainContent_ucFinancialComplianceTable_gdvFinancialAndComplianceHistory" style="border-width:0px;border-style:None;border-collapse:collapse;"> |
| | <tr> |
| | <th class="GridHeading" scope="col" style="width:146px;">Financial year end (FYE)</th><th class="GridHeading" scope="col" style="width:73px;">Income</th><th class="GridHeading" scope="col" style="width:73px;">Spending</th><th class="GridHeading" scope="col" style="width:146px;">Accounts received</th><th class="GridHeading" scope="col" style="width:146px;">Annual Return received</th><th class="GridHeading" scope="col" style="width:73px;">View</th> |
| | </tr><tr style="background-color:#FFFF99;"> |
| | <td class="GridColumnText" style="width:146px;">31 Aug 2018</td><td class="GridColumnNumeric" style="width:73px;">£1,313,932</td><td class="GridColumnNumeric" style="width:73px;">£1,312,739</td><td class="GridColumnText" style="width:146px;">30 Jun 2019</td><td class="GridColumnText" style="width:146px;">30 Jun 2019</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2018 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20180831_E_C.pdf'>Accounts</a></td> |
| | </tr><tr style="background-color:#FFFFCC;"> |
| | <td class="GridColumnText" style="width:146px;">31 Aug 2017</td><td class="GridColumnNumeric" style="width:73px;">£1,166,743</td><td class="GridColumnNumeric" style="width:73px;">£1,127,987</td><td class="GridColumnText" style="width:146px;">05 Jun 2018</td><td class="GridColumnText" style="width:146px;">05 Jun 2018</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2017 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20170831_E_C.pdf'>Accounts</a></td> |
| | </tr><tr style="background-color:#FFFF99;"> |
| | <td class="GridColumnText" style="width:146px;">31 Aug 2016</td><td class="GridColumnNumeric" style="width:73px;">£1,169,515</td><td class="GridColumnNumeric" style="width:73px;">£1,159,754</td><td class="GridColumnText" style="width:146px;">13 Jun 2017*</td><td class="GridColumnText" style="width:146px;">13 Jun 2017</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2016 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20160831_E_C.pdf'>Accounts</a></td> |
| | </tr><tr style="background-color:#FFFFCC;"> |
| | <td class="GridColumnText" style="width:146px;">31 Aug 2015</td><td class="GridColumnNumeric" style="width:73px;">£1,193,076</td><td class="GridColumnNumeric" style="width:73px;">£1,112,737</td><td class="GridColumnText" style="width:146px;">12 May 2016</td><td class="GridColumnText" style="width:146px;">26 May 2016</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2015 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20150831_E_C.pdf'>Accounts</a></td> |
| | </tr><tr style="background-color:#FFFF99;"> |
| | <td class="GridColumnText" style="width:146px;">31 Aug 2014</td><td class="GridColumnNumeric" style="width:73px;">£1,129,065</td><td class="GridColumnNumeric" style="width:73px;">£1,021,361</td><td class="GridColumnText" style="width:146px;">06 Mar 2015</td><td class="GridColumnText" style="width:146px;">08 Jun 2015</td><td class="GridColumnText" style="width:73px;"><a title='31 Aug 2014 Accounts PDF' target='_blank' href='/Accounts/Ends74\0000267174_AC_20140831_E_C.pdf'>Accounts</a></td> |
| | </tr> |
| | </table> |
| | </div |
I want to extract the data in bold. they are DATE, INCOME and EXPENDITURE over 5 years. I've been trying to build a regular expression but to avail. Similarly, I've downloaded the data to a temp file and if I try and filter it does it on the tempfile name, not the content... Feel I'm a little out of my depth. Any pointers?