Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Scraping universities webs and description maybe.

MZ900605
8 - Asteroid

Hello, I need help in scraping universities from this website which is really complicated for me: https://whed.net/home.php

since I need names and some of their field of studies or at least websites and it only show if i click on the map or select description anyway this can be done efficiently please.

Thanks in advance.

9 REPLIES 9
IraWatt
17 - Castor
17 - Castor

Hey @MZ900605,

Is it this information your interested in?

IraWatt_0-1656413973559.png

or this:

IraWatt_1-1656413999729.png

Can you screenshot and highlight what information specifically your interested in?

Thanks,

Ira

 

 

smoskowitz
12 - Quasar

Hi @MZ900605 --

 

I took a quick look at the site and I think its do-able. Personally, I think it would be somewhat easier using Python and BeautifulSoup than Alteryx, but the Python script can be put into the Python tool for any downstream processing.

 

I don't have the time to try to code this out (as my coding skills are weak.) The biggest challenge I see is figuring out the URL for each country or state web address. Once you crack that, then you can:

 

  • Write each country URL to a list.
  • Loop through each country and get all school URL's and write all of those to a list.
  • The loop through each school URL and collect the relevsant data and maybe write that to a pandas dataframe.

Hopefully that provides some guidance.

 

Thanks,

Seth

MZ900605
8 - Asteroid

the actual names and websites if we can.

MZ900605
8 - Asteroid

would love python also..

MZ900605
8 - Asteroid

@IraWatt I hope this helps Thanks.
@smoskowitz That's really interesting will give it a try if Alteryx did not help well.

IraWatt
17 - Castor
17 - Castor

Hey @MZ900605,

just looking at your Canada example on the map when you click Canada then 

IraWatt_2-1656415862047.png

This is the request which generates the page:

 

IraWatt_0-1656415829754.png

 

IraWatt_3-1656415912622.png

(this is the view source of that page Search Results – WHED – IAU's World Higher Education Database):

IraWatt_0-1656415397052.png

The links to each box is stored on the page here. Eg the popup for "Canada - Northwest Territories" has the address: https://whed.net/detail_system.php?JTo2MF0tIzRgCmAK which you can request from Alteryx with the download tool. 

 

I think you would need to replicate these requests in Alteryx or Python to download all the countries information. 

 

MZ900605
8 - Asteroid

@IraWatt so what i need is to copy each link and connect it to download tool ?

Screenshot (59).png

IraWatt
17 - Castor
17 - Castor

@MZ900605 Here is an example workflow to get this page:

IraWatt_3-1656416226707.pngIraWatt_4-1656416255881.png

I also updated my initial look at the problem above^ 

 

 

MZ900605
8 - Asteroid

@IraWatt this will take ages :) 

I hope to find a way to collect universities names and the courses they have but this was the only website I found :(

Labels