Alteryx Designer Desktop Discussions

JVelarde · ‎05-05-2022

Hello All,

I am trying to download data from a web sight, Inspector General Reports | Oversight.gov, more specifically the reports table. I have found many options, using the Download tool or Python tool, but my limited knowledge in the area brings me to the discussion. Can the community advice me, what's the bets way to scrap the data?

Thank you

v/r

John Velarde

VA OIG

DataNath · ‎05-05-2022

Tend to use the download tool and then just start parsing the DownloadData with RegEx to break it down/narrow it down further until you get your headers and values and then can cross-tab this into a functional table.

IraWatt · ‎05-05-2022

So by the looks of it most of the data you want is loaded in statically you can see this by looking at the page source view-source:https://www.oversight.gov/reports this means you can just use the download tool to get the data. For instance if we look at this graph here:

That data is held within this JS object, so you could use the JSON parse tool to get it:

If you need any help make use to ask :)

HTH,

Ira

JVelarde · ‎05-05-2022

Aww yes, but my Reg Ex skills are beginner at best.

JVelarde · ‎05-05-2022

Hey Ira,

I see that! maybe I should explain what I am trying to to. I want to do some text mining of the data, to see if other work has been done in a specific Audit area, so we dont as OIG's duplicate work. This web sight is a good source for all that data, but I want to grab all the reports, download the data, and do some analysis. Does that makes sense?

Thank you everyone in advance.

V/R

John

IraWatt · ‎05-05-2022

Hey @JVelarde,

Are you referring to this table with the documents in? Grabbing details from tables is quite straight forward no Regex required.

Would you also want to grab the details within the links also?

JVelarde · ‎05-05-2022

Hey Ira,

Yes! and than take that data sent and do some text mining. If i could just get it into a table of some kind.

john

IraWatt · ‎05-05-2022

Hey @DataNath,

Here's an example workflow to get some of the information from the table, hopefully enough to be going on:

One thing to note is that there are several pages for this table. It seems you just need to increment the page number on the URl eg. https://www.oversight.gov/reports?page=1 -> https://www.oversight.gov/reports?page=2 ect... To achieve this in Alteryx you will want to set up a batch or iterative macro to get every page. The community has some great videos on the topic if you are unfamiliar with the topic.

JVelarde · ‎05-05-2022

Hey Ira,

That worked but yea I would need to batch and pull the other pages...

Thank you!

IraWatt · ‎05-05-2022

No worries, glad I could help get you started :)

Alteryx Designer Desktop Discussions

Web Scraping

Re: Is there any way the computer vision tools can...

Re: Batch Macro

Re: How to get cell reference address from excel

Re: Replacing Forecast columns with Actual Data

Re: Row creation