Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Webscraping Gurus needed

danespoors
8 - Asteroid

Good afternoon from the snow covered UK,

 

I am extremely stuck and in need of assistance. I work for a university and part of my role is to download and analyse what we call "League Tables" which rank universities on a number of criteria. One of the organisations that creates these league tables is the Complete University Guide.

 

Here's my issue, I'm wanting to download the league tables from this website using alteryx but it isn't playing ball. I had this set up last year and then they changed the website design and my workflow is now completely useless. There is a drop down element to the website when deciding which ones to download and I have not idea how to get a webscraper set up to manipulate it. Alternatively, I tried downloading the elements from the website itself instead of downloading the csv files but that didn't work either.

 

The website is: https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?tabletype=full-table and this is the main page with the summary table for all the universities.

 

In addition, there are individual subject tables that can be accessed from the main page i.e.: https://www.thecompleteuniversityguide.co.uk/league-tables/rankings/accounting-and-finance?tabletype... 

 

I require the full league table and all of the subject tables. Last year I had 3 months in order to build this webscraper and it worked fantastically. This time, I have 1 month and the task seems more complicated due to the new design of the website. If anyone knows how to do this or how to even start this, I would be very grateful indeed.

 

Many thanks,

 

Dane.

 

 

3 REPLIES 3
OllieClarke
15 - Aurora
15 - Aurora

Hi @danespoors 

I had a quick go at the first table to get you going, and have attached my workflow below. I went for a RegEx heavy approach based on looking at the html of the website (found by right clicking the web-page and clicking inspect). 

I'm not sure how dynamic this workflow is, but if this is a site that you only need to scrape once, then it should get you somewhere at least. 

 

OllieClarke_0-1612985244656.png

Ouput:

OllieClarke_1-1612985397411.png


For getting the other subjects, if you inspect the drop down, there is a list of all of them which you can copy out and clean up (replacing '&' with 'and' and spaces with '-'). From here you can append that list onto the url stem to get your new urls

 

Hope that helps, or at least points you in the right direction.

 

Best,

 

Ollie

danespoors
8 - Asteroid

You sir, are a genius! This is marvellous and it will most definitely do the job!

 

Thank you for your quick solution to this, it will help me a LOT.

 

Dane.

danespoors
8 - Asteroid

Oh wise guru,

 

I have been looking at your workflow and playing with it so that I can wrap my head around what you have done (I'm a regex novice so I'm trying to pick it up) and I was wondering, is there a way that you can plug the subject urls that you've made into the workflow and have them download nicely?

 

I took a stab at it and got a little lost off when I was trying to rejig the middle section.

 

It is absolutely fine if it's not or if you are done with this project, I just thought I'd ask 🙂

 

Again, thank you for your help 🙂

 

Dane.

Labels