Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Web scraping

msve
8 - Asteroid

Hello,

 

There's a website - https://www.treasurydirect.gov/instit/annceresult/annceresult.htm from where I would like to scrape data single row of data from the table on the website. Would like to get the first 26-week row data. I tried using the download tool but not able figure out how to get to the table. 

 

msve_0-1628647148926.png

 

11 REPLIES 11
danilang
19 - Altair
19 - Altair

Hi @msve 

 

Check out this post from the The Data School.  It contains step-by-step instructions on how download, find and extract table data from an HTML page

 

Dan

atcodedog05
22 - Nova
22 - Nova

Hi @danilang 

 

I was not able to find <table> component in the webpage. Any comments on what's happening and how we can approach it.

Maskell_Rascal
13 - Pulsar

Thanks for sharing this post @danilang! Definitely going to bookmark that one for future use. 

 

@atcodedog05 - the <table> components are there, they are just buried within a <div> tag. 

 

Maskell_Rascal_0-1628691255953.png

 

Luke_C
17 - Castor

Hi @atcodedog05 

 

I see the <table> tags:

Luke_C_0-1628691234203.png

 

 

atcodedog05
22 - Nova
22 - Nova

Hi @Maskell_Rascal and @Luke_C 

 

In inspect you will definitely find <table>. You need to check veiw page source because that's what comes from download tool. You can also check the source code by saving the webpage and opening html in notepad. 😅

cmcclellan
13 - Pulsar

This was certainly an AWESOME challenge, and although I agree with all the other replies - they don't work.  Don't get me wrong, I would have posted exactly the same response yesterday. 

 

I followed all the normal steps for web scraping, but the data wasn't there.  I thought about it for a while, did some googling and this link helped me https://www.thedataschool.co.uk/joe-carr/webscraping-through-alteryx-as-if-you-are-logged-in

 

I'd never used the Network tab before, but that unlocked the missing knowledge and I created the attached workflow.

 

You can see that I'm NOT downloading the HTML, I'm downloading the API call that is embedded in the HTML that provides the actual data.

cmcclellan
13 - Pulsar

I realised that the API they are using accepts a time input, so some of the values are wrong.  I'll improve the workflow and post again later.

atcodedog05
22 - Nova
22 - Nova

Hi @cmcclellan 

 

Looking forward to hearing on how you solved it 🙂

cmcclellan
13 - Pulsar

OK, here it is 🙂 

 

Props to Joe Carr for this https://www.thedataschool.co.uk/joe-carr/webscraping-through-alteryx-as-if-you-are-logged-in it was the information that I needed to make the entire thing work.

 

My initial reaction was like everyone else here - web scraping a table ... EASY ..... uh, NOT so easy with this one.  Load the URL in a browser, load the same URL in Alteryx and the data is not there.  Change the header in Alteryx to make it look like it's a browser - STILL no data.

 

Then I read Joe's article.  OK, I didn't read Joe's article I saw the image about the Network tab in the Inspect tools - BINGO! ...   Back to Chrome, do the same thing and there's 6 API looking calls (so they probably match the 6 tabs on the webpage).  

 

Easy - ok, write the workflow, call the 6 APIs, check the results and it's perfect.  BUT leave it a few hours and it's not - comparing the data from browser to Alteryx and it used to be perfect and now it's not.  The only thing I did was re-run the workflow.

 

I looked at the API URLs again and saw some numbers on the end (1628744311727&_=1628744311728) .... hmmm ... 2 big numbers, ending in 27 and then 28 .... vague memories of epoch times !

 

So I found this https://www.epochconverter.com/ and you can see the epoch time NOW, and enter your own time to convert.

 

OK, so the API is sending a to & from in epoch milliseconds !!!  better still, in UTC time.

 

EASY !!!! .... get the current UTC time in epoch, convert to milliseconds (you might not have to convert, I didn't bother testing), modify the API URL for the times and the data matches - always !!

 

The API actually gives you a huge amount of fields (check the Select just after the Cross Tab) but I've only kept the fields that you need for each tab.

 

Hopefully it works for you, let me know if you have any questions or need help.

Labels