Extracting URL of all pages of a website.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I'm running into a problem which I can't fix. My aim is to extract all the URLs from a website called Transfermarkt (link 1). It's a website which contains information about football transfers on specific dates, which you can specify. Currently, in Alteryx, I can extract all the transfers on the first page of each day (I plan to extract URLs from 16/02/2021 to - Current date). What I can't figure out is how to extract the URLs for more than the first page of dates which have more than one page.
All the links have the same format, the only difference between each link is that the date part changes and the page number.
My next step, once I get the complete URLs, is to run the links through a batch macro (which I've already created) to clean and parse the transfer information I've scraped from the URL and output the data in a tabular format with all the transfer information of each player.
I've attached an Alteryx workflow which currently extracts the URLs of the first page for the date range specified.
I've been learning Alteryx for 2 weeks now so any pointers would be much appreciated.
Solved! Go to Solution.
- Labels:
- Iterative Macro
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
From a quick inspection of the site, it looks like you can tell what the max page is for that date - it appears under 'Go to the last page (Page X)'. Therefore, you could parse this number out and once again do generate rows - like you did for your dates - using the condition that your page number <= last page. This is going to be a pretty meaty run and you'd be sending an awful lot of requests to the site so I'd be sure to rate limit your calls pretty generously and be patient with your run (assuming they don't mind being scraped - I've not checked).
(To be able to get this/parse it out, you'd need to do an initial call of each date > parse out the max page > generate all of your links > re-feed into download tool).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Thank you Nath!
