Hello,
There's a website - https://www.treasurydirect.gov/instit/annceresult/annceresult.htm from where I would like to scrape data single row of data from the table on the website. Would like to get the first 26-week row data. I tried using the download tool but not able figure out how to get to the table.
Solved! Go to Solution.
Hi @msve
Check out this post from the The Data School. It contains step-by-step instructions on how download, find and extract table data from an HTML page
Dan
Hi @danilang
I was not able to find <table> component in the webpage. Any comments on what's happening and how we can approach it.
Thanks for sharing this post @danilang! Definitely going to bookmark that one for future use.
@atcodedog05 - the <table> components are there, they are just buried within a <div> tag.
Hi @Maskell_Rascal and @Luke_C
In inspect you will definitely find <table>. You need to check veiw page source because that's what comes from download tool. You can also check the source code by saving the webpage and opening html in notepad. 😅
This was certainly an AWESOME challenge, and although I agree with all the other replies - they don't work. Don't get me wrong, I would have posted exactly the same response yesterday.
I followed all the normal steps for web scraping, but the data wasn't there. I thought about it for a while, did some googling and this link helped me https://www.thedataschool.co.uk/joe-carr/webscraping-through-alteryx-as-if-you-are-logged-in
I'd never used the Network tab before, but that unlocked the missing knowledge and I created the attached workflow.
You can see that I'm NOT downloading the HTML, I'm downloading the API call that is embedded in the HTML that provides the actual data.
I realised that the API they are using accepts a time input, so some of the values are wrong. I'll improve the workflow and post again later.
OK, here it is 🙂
Props to Joe Carr for this https://www.thedataschool.co.uk/joe-carr/webscraping-through-alteryx-as-if-you-are-logged-in it was the information that I needed to make the entire thing work.
My initial reaction was like everyone else here - web scraping a table ... EASY ..... uh, NOT so easy with this one. Load the URL in a browser, load the same URL in Alteryx and the data is not there. Change the header in Alteryx to make it look like it's a browser - STILL no data.
Then I read Joe's article. OK, I didn't read Joe's article I saw the image about the Network tab in the Inspect tools - BINGO! ... Back to Chrome, do the same thing and there's 6 API looking calls (so they probably match the 6 tabs on the webpage).
Easy - ok, write the workflow, call the 6 APIs, check the results and it's perfect. BUT leave it a few hours and it's not - comparing the data from browser to Alteryx and it used to be perfect and now it's not. The only thing I did was re-run the workflow.
I looked at the API URLs again and saw some numbers on the end (1628744311727&_=1628744311728) .... hmmm ... 2 big numbers, ending in 27 and then 28 .... vague memories of epoch times !
So I found this https://www.epochconverter.com/ and you can see the epoch time NOW, and enter your own time to convert.
OK, so the API is sending a to & from in epoch milliseconds !!! better still, in UTC time.
EASY !!!! .... get the current UTC time in epoch, convert to milliseconds (you might not have to convert, I didn't bother testing), modify the API URL for the times and the data matches - always !!
The API actually gives you a huge amount of fields (check the Select just after the Cross Tab) but I've only kept the fields that you need for each tab.
Hopefully it works for you, let me know if you have any questions or need help.