Hello All,
I am trying to download data from a web sight, Inspector General Reports | Oversight.gov, more specifically the reports table. I have found many options, using the Download tool or Python tool, but my limited knowledge in the area brings me to the discussion. Can the community advice me, what's the bets way to scrap the data?
Thank you
v/r
John Velarde
VA OIG
Solved! Go to Solution.
Tend to use the download tool and then just start parsing the DownloadData with RegEx to break it down/narrow it down further until you get your headers and values and then can cross-tab this into a functional table.
So by the looks of it most of the data you want is loaded in statically you can see this by looking at the page source view-source:https://www.oversight.gov/reports this means you can just use the download tool to get the data. For instance if we look at this graph here:
That data is held within this JS object, so you could use the JSON parse tool to get it:
If you need any help make use to ask :)
HTH,
Ira
Aww yes, but my Reg Ex skills are beginner at best.
Hey Ira,
I see that! maybe I should explain what I am trying to to. I want to do some text mining of the data, to see if other work has been done in a specific Audit area, so we dont as OIG's duplicate work. This web sight is a good source for all that data, but I want to grab all the reports, download the data, and do some analysis. Does that makes sense?
Thank you everyone in advance.
V/R
John
Hey @JVelarde,
Are you referring to this table with the documents in? Grabbing details from tables is quite straight forward no Regex required.
Would you also want to grab the details within the links also?
Hey Ira,
Yes! and than take that data sent and do some text mining. If i could just get it into a table of some kind.
john
Hey @DataNath,
Here's an example workflow to get some of the information from the table, hopefully enough to be going on:
One thing to note is that there are several pages for this table. It seems you just need to increment the page number on the URl eg. https://www.oversight.gov/reports?page=1 -> https://www.oversight.gov/reports?page=2 ect... To achieve this in Alteryx you will want to set up a batch or iterative macro to get every page. The community has some great videos on the topic if you are unfamiliar with the topic.
Hey Ira,
That worked but yea I would need to batch and pull the other pages...
Thank you!
No worries, glad I could help get you started :)