Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Web Scraping

JVelarde
6 - Meteoroid

Hello All,

 

  I am trying to download data from a web sight, Inspector General Reports | Oversight.gov, more specifically the reports table. I have found many options, using the Download tool or Python tool, but my limited knowledge in the area brings me to the discussion. Can the community advice me, what's the bets way to scrap the data? 

 

  Thank you

 

v/r

John Velarde

VA OIG 

 

 

 

 

 

9 REPLIES 9
DataNath
17 - Castor
17 - Castor

Tend to use the download tool and then just start parsing the DownloadData with RegEx to break it down/narrow it down further until you get your headers and values and then can cross-tab this into a functional table.

IraWatt
17 - Castor
17 - Castor

So by the looks of it most of the data you want is loaded in statically you can see this by looking at the page source view-source:https://www.oversight.gov/reports this means you can just use the download tool to get the data. For instance if we look at this graph here:

IraWatt_0-1651764718772.png

That data is held within this JS object, so you could use the JSON parse tool to get it:

IraWatt_1-1651764742320.png

If you need any help make use to ask :) 

HTH,

Ira

JVelarde
6 - Meteoroid

Aww yes, but my Reg Ex skills are beginner at best. 

JVelarde
6 - Meteoroid

Hey Ira,

 

  I see that! maybe I should explain what I am trying to to. I want to do some text mining of the data, to see if other work has been done in a specific Audit area, so we dont as OIG's duplicate work. This web sight is a good source for all that data, but I want to grab all the reports, download the data, and do some analysis. Does that makes sense? 

 

 Thank you everyone in advance.

 

V/R

John

IraWatt
17 - Castor
17 - Castor

Hey @JVelarde,

Are you referring to this table with the documents in? Grabbing details from tables is quite straight forward no Regex required.

IraWatt_0-1651767621818.png

Would you also want to grab the details within the links also?

JVelarde
6 - Meteoroid

Hey Ira,

 

 Yes!  and than take that data sent and do some text mining. If i could just get it into a table of some kind. 

 

john

IraWatt
17 - Castor
17 - Castor

Hey @DataNath,

Here's an example workflow to get some of the information from the table, hopefully enough to be going on:

IraWatt_0-1651769982022.png

One thing to note is that there are several pages for this table. It seems you just need to increment the page number on the URl eg. https://www.oversight.gov/reports?page=1  -> https://www.oversight.gov/reports?page=2 ect... To achieve this in Alteryx you will want to set up a batch or iterative macro to get every page. The community has some great videos on the topic if you are unfamiliar with the topic.  

 

JVelarde
6 - Meteoroid

Hey Ira,

 

  That worked but yea I would need to batch and pull the other pages...

 

Thank you! 

IraWatt
17 - Castor
17 - Castor

No worries, glad I could help get you started :) 

Labels