Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Engine Works

Under the hood of Alteryx: tips, tricks and how-tos.
danielbrun2
ACE Emeritus
ACE Emeritus

About 2 months ago, I did a little project on the Alteryx Community. The purpose was to increase the proportion of closed cases, and even more important decrease the number of cases with 0 responses. In an email from Andy Cooper (Alteryx Solutions engineer) he shared that in the first published version there were 52 topics with 0 responses. He was quite satisfied that they managed to get it down to 30 in 14 days. Now we are down to 8 topics with 0 reponses which I think is a natural level since there will always be new unanswered cases.

 

Here is the workflow as it looks today (one macro doing the download of data, one that parses the downloaded data and output to Tableau and a .yxdb for other purposes):

db_1.png

 

So how did I do it?

 

Well I will try to share my story and hopefully get you started with webscraping (There are MANY ways of doing it – this one is back to basics). I will divide the post into 3 parts

 

  1. Finding the structure (look at the page and figure out what you want to do)
  2. Downloading the data (Download all the pages you want)
  3. Parsing the data (Parse the data in your preferred manner)

 

1. Finding the structure

 

The first step for me when I do webscrabing is to find out what elements I want and how I change pages. Lets take the Alteryx Community as an example:

 

1. Five Different Categories

db_2.png

2. Six Different Elements

db_3.png

1. Solved or not?

2. Title

3. Last post

4. Replies

5. Start

6. Views

 

3. How do I change the page?

This is often an easy question to answer. Load the first page and go to page 2, and look for the diffence. From the two links below my best guess is that the ”/page/2” defines page 2 and that ”/page/3” would then give me page 3.

 

Now that we have analysed the structure of the site, I will move on to the download task.

 

2. Downloading the Data

 

The purpose of this section is to describe the process of downloading the data or in other words download one page at a time. I accomplished this by doing a iterative macro that continues until the page returns something other than a “200 OK” in the download header. This criteria can be set different depending on the site you are scraping (try to call a page that does not exist and see what the page returns).

 

Here is a sample of the workflow:

db_4.png

 

The flow is divided into two parts. The first part downloads a page and the second parts evaluates if the page exists. If the page exists the record will be outputted to both the macro output and to the iteration output, which tells the macro to run once again. If the page does not exist there will be no more records in the iteration output and the macro will stop.

 

3. Parsing the Data

 

The last part is to parse the data, which can be done in different ways. I needed to train my Regex skills so I did it using Regex (the xml parser is often also good for HTML).

 

The flow looks like this:

 

db_5.png

If you do not know HTML this might be a bit tricky, however, there are good tools to get you going. I prefer to use the developer console in Google Chrome.

 

Please take into account that you will do one request for each page and therefore put some traffic on the website. What I usually do is to save the output and then do the parsing from a local file. I hope this can inspire some of you to get started on webscraping. If you have any questions please ping me or come to San Diego and enjoy Inspire 2016 - I will definitely be there! You can register here.

 

Have fun scraping!

 

Comments