Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

web scraping Wikipedia

MZ900605
8 - Asteroid

Hello guys, I need help in collecting worldwide university names in a single table, I tried it on some websites turns out to be outdated or not accurate.

https://en.wikipedia.org/wiki/Lists_of_universities_and_colleges_by_country

any help or suggestions to make this task quick or easier?

any help would be great thanks.

the output should be like

 

university namecountry
university of LondonLondon UK
Stockholm universityStockholm Sweden...



5 REPLIES 5
HomesickSurfer
12 - Quasar

Hi @MZ900605 

 

This will get you started...a scrape of universities in Canada from: https://en.wikipedia.org/wiki/List_of_universities_in_Canada

Workflow attached.

You will need to do same for additional countries by feeding in a list of countries, modify 'Canada' in the URL and process each in a batch macro...

 

DOWNLOAD EXAMPLE.PNG

clmc9601
13 - Pulsar
13 - Pulsar

Hi @MZ900605,

 

I mocked up a workflow that parses the urls from the first page and then reads each website individually. No batch macro is needed. The resulting data will definitely need some cleaning, but here's the start! 

MZ900605
8 - Asteroid

@clmc9601 

Thanks for the help but the result is still a reference.Screenshot (260).png

MZ900605
8 - Asteroid

@HomesickSurfer Thanks for the help, yeah it's working but it's a lot to copy and paste, like 200 links. any way around this?

HomesickSurfer
12 - Quasar

Hi @MZ900605 

 

Using @clmc9601 's sample workflow to compile a list of countries, I've attached my portion as a batch macro.

See attached package.  It's not ideal because many of the lists of universities are not in table format as is Canada, thus having schema issues.

It's a start @MZ900605 .  Perhaps Wikipedia has an API to use instead...

 

DOWNLOAD EXAMPLE.PNG

Labels