About 2 months ago, I did a little project on the Alteryx Community. The purpose was to increase the proportion of closed cases, and even more important decrease the number of cases with 0 responses. In an email from Andy Cooper (Alteryx Solutions engineer) he shared that in the first published version there were 52 topics with 0 responses. He was quite satisfied that they managed to get it down to 30 in 14 days. Now we are down to 8 topics with 0 reponses which I think is a natural level since there will always be new unanswered cases.
Here is the workflow as it looks today (one macro doing the download of data, one that parses the downloaded data and output to Tableau and a .yxdb for other purposes):
So how did I do it?
Well I will try to share my story and hopefully get you started with webscraping (There are MANY ways of doing it – this one is back to basics). I will divide the post into 3 parts
- Finding the structure (look at the page and figure out what you want to do)
- Downloading the data (Download all the pages you want)
- Parsing the data (Parse the data in your preferred manner)
1. Finding the structure
The first step for me when I do webscrabing is to find out what elements I want and how I change pages. Lets take the Alteryx Community as an example:
1. Five Different Categories
2. Six Different Elements
1. Solved or not?
2. Title
3. Last post
4. Replies
5. Start
6. Views
3. How do I change the page?
This is often an easy question to answer. Load the first page and go to page 2, and look for the diffence. From the two links below my best guess is that the ”/page/2” defines page 2 and that ”/page/3” would then give me page 3.
Now that we have analysed the structure of the site, I will move on to the download task.
2. Downloading the Data
The purpose of this section is to describe the process of downloading the data or in other words download one page at a time. I accomplished this by doing a iterative macro that continues until the page returns something other than a “200 OK” in the download header. This criteria can be set different depending on the site you are scraping (try to call a page that does not exist and see what the page returns).
Here is a sample of the workflow:
The flow is divided into two parts. The first part downloads a page and the second parts evaluates if the page exists. If the page exists the record will be outputted to both the macro output and to the iteration output, which tells the macro to run once again. If the page does not exist there will be no more records in the iteration output and the macro will stop.
3. Parsing the Data
The last part is to parse the data, which can be done in different ways. I needed to train my Regex skills so I did it using Regex (the xml parser is often also good for HTML).
The flow looks like this:
If you do not know HTML this might be a bit tricky, however, there are good tools to get you going. I prefer to use the developer console in Google Chrome.
Please take into account that you will do one request for each page and therefore put some traffic on the website. What I usually do is to save the output and then do the parsing from a local file. I hope this can inspire some of you to get started on webscraping. If you have any questions please ping me or come to San Diego and enjoy Inspire 2016 - I will definitely be there! You can register here.
Have fun scraping!