Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Web Scraping Question

smoskowitz
12 - Quasar

How would I handle downloading multiple webpages, My example is a monoprice page:

 

https://www.monoprice.com/search/index?keyword=graphics%20cards&pg=1

 

There are multiple pages -- as you can see the end with pg=1. How would it iterate to pg=2 to what ever the last page is? 

 

Thanks,

Seth

9 REPLIES 9
NickC
Alteryx Alumni (Retired)

Hello Seth,

 

 

In order to do this you will need to create an iterative macro that will increase https://www.monoprice.com/search/index?keyword=graphics%20cards&pg=the page number at the end.  The iterative macro will need to be set to stop running when there are no more pages to run. 

 

Alternatively and the lazier way around this would be to set the page to show 300 records in which you can just use this URL 

https://www.monoprice.com/search/index?keyword=graphics%20cards&PageSize=300

 

 

Be careful when web scraping 3rd party data as the owner might not want it scraped.  

 

Nick

smoskowitz
12 - Quasar

Thank you! This is a good start and as a new Alteryx user have not even begun to think about creating macros yet. At least I know of a methodology to think about.

Joe_Mako
12 - Quasar

Attached is an example iterative macro that will download a page, check the paging div to see if it contains the "next.png" reference (this would need to be customized for each website), and it does, loop back in and download the next page. I have also included a workflow that uses this, notice the select tool that changes the data type before hitting the macro.

 

download it.png

gnans19
11 - Bolide

So you are relying on "next.png" of current page to decide if there is next page.

@Joe_MakoI like your approach. As always, I am big fan of your solution.

 

What if(worst case) there is no next button. Is there a way to exit iterative macro till we get an error(cannot resolve the host name) from download tool.

Joe_Mako
12 - Quasar

This solution was build specifically for this webpage, using regex to pull out the div and noticing that the image is only there when there is a next page.

 

If there is no next image, the macro will just exit as normal because there is no next page.

 

Or is your question about a different webpage that has no next button to begin with? Then yes, you can use similar logic (some things would need to be rewired), test if the DownloadedData contains actual data or not, if it does, output that to the results, and loop to the next page, otherwise, send no record to the loop output, and that ends the macro. If you would like to see an example, please provide a webpage, as the one in this thread does not error, and it also has a next option when there is a next page only.

gnans19
11 - Bolide

@Joe_Mako

 

If the webpage doesn't exist, download tool throws error. I just tried with a invalid url.

Joe_Mako
12 - Quasar

What URL are you attempting? I would think that if you try an invalid hostname like "!@#$%.!@#$%" then the download tool would error like other tools in Alteryx, but if it is a valid hostname, then the downloadheader would return something like a 400 or 404. Please provide more details on what you are experience so it can be recreated, thank you!

gnans19
11 - Bolide

You are right. I was trying with a random url. If the hostname is valid, then I am getting some valid error message from the host. Ex: http://google.com/xyz (instead of http://erthjkjasdfsdglsdfhgi.com/)

 

 

Thanks again for clarifying!

Labels