Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

html scraping multiple pages

Simon2902
6 - Meteoroid

Hi all

 

I looked through the community and while there is certainly a lot of content on html/web scraping I couldn't sort out my issue - apologies if I missed a vital topic somewhere.

 

I am looking to scrape data from each url that this directory links to: https://www.scimagojr.com/journalrank.php

 

Simon2902_0-1580850373034.png

 

 

Specifically I need the "subject area" and "category" from each of the urls (e.g.: https://www.scimagojr.com/journalsearch.php?q=19434&tip=sid&clean=0).

 

Simon2902_1-1580850455210.png

 

The idea is to map out "area" and "category". Given that there are more than 30,000 entries, I'd have to go via multiple directory pages.

 

I so far only managed to extract the static individual page urls of the first page but appreciate I probably require a dynamic batch-based workflow to go through all urls on all pages. My basic starting-point workflow is attached. Note: I am not well versed in Python hence assumed that the macro route would be worth exploring.

 

Many thanks

2 REPLIES 2
KP_DML
8 - Asteroid

If you use the "Download Data" link you can get the entire list without having to deal with pagination. The second column, SourceId, can be used to build the link in a formula:

"https://www.scimagojr.com/journalsearch.php?q=" + [SourceId] + "&tip=sid&clean=0"

 

You can then hit each of those URLs with the download tool to get the page and parse the information you need. You may want to do this in small batches as 30,000 rapid-fire requests may be noticed and defensive actions taken by the host.

 

 

Simon2902
6 - Meteoroid

Thanks @KP_DML, that worked well. Much easier than I expected - should have inspected the download file in more detail 🙂

Labels