Alteryx Designer Desktop Discussions

Simon2902 · ‎02-04-2020

Hi all

I looked through the community and while there is certainly a lot of content on html/web scraping I couldn't sort out my issue - apologies if I missed a vital topic somewhere.

I am looking to scrape data from each url that this directory links to: https://www.scimagojr.com/journalrank.php

Specifically I need the "subject area" and "category" from each of the urls (e.g.: https://www.scimagojr.com/journalsearch.php?q=19434&tip=sid&clean=0).

The idea is to map out "area" and "category". Given that there are more than 30,000 entries, I'd have to go via multiple directory pages.

I so far only managed to extract the static individual page urls of the first page but appreciate I probably require a dynamic batch-based workflow to go through all urls on all pages. My basic starting-point workflow is attached. Note: I am not well versed in Python hence assumed that the macro route would be worth exploring.

Many thanks

KP_DML · ‎02-04-2020

If you use the "Download Data" link you can get the entire list without having to deal with pagination. The second column, SourceId, can be used to build the link in a formula:

"https://www.scimagojr.com/journalsearch.php?q=" + [SourceId] + "&tip=sid&clean=0"

You can then hit each of those URLs with the download tool to get the page and parse the information you need. You may want to do this in small batches as 30,000 rapid-fire requests may be noticed and defensive actions taken by the host.

Simon2902 · ‎02-06-2020

Thanks @KP_DML, that worked well. Much easier than I expected - should have inspected the download file in more detail 🙂

Alteryx Designer Desktop Discussions

html scraping multiple pages