Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Extracting URL of all pages of a website.

Shrawan890
5 - Atom

I'm running into a problem which I can't fix. My aim is to extract all the URLs from a website called Transfermarkt (link 1). It's a website which contains information about football transfers on specific dates, which you can specify. Currently, in Alteryx, I can extract all the transfers on the first page of each day (I plan to extract URLs from 16/02/2021 to - Current date). What I can't figure out is how to extract the URLs for more than the first page of dates which have more than one page. 

 

All the links have the same format, the only difference between each link is that the date part changes and the page number.

 

My next step, once I get the complete URLs, is to run the links through a batch macro (which I've already created) to clean and parse the transfer information I've scraped from the URL and output the data in a tabular format with all the transfer information of each player.

 

I've attached an Alteryx workflow which currently extracts the URLs of the first page for the date range specified.

 

Link 1: https://www.transfermarkt.co.uk/transfers/transfertagedetail/statistik/top/land_id_zu/0/land_id_ab/0...

 

I've been learning Alteryx for 2 weeks now so any pointers would be much appreciated.

2 REPLIES 2
DataNath
17 - Castor

From a quick inspection of the site, it looks like you can tell what the max page is for that date - it appears under 'Go to the last page (Page X)'. Therefore, you could parse this number out and once again do generate rows - like you did for your dates - using the condition that your page number <= last page. This is going to be a pretty meaty run and you'd be sending an awful lot of requests to the site so I'd be sure to rate limit your calls pretty generously and be patient with your run (assuming they don't mind being scraped - I've not checked).

 

(To be able to get this/parse it out, you'd need to do an initial call of each date > parse out the max page > generate all of your links > re-feed into download tool).

 

DataNath_1-1652996807545.png

 

Shrawan890
5 - Atom

Thank you Nath!

Labels