Alteryx Designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer and Intelligence Suite.

Webscraping data from a link that opens in another webpage

Roche
8 - Asteroid

Hi everyone, 

 

I have to webscrape data from the following website: 

https://appsource.microsoft.com/en-nl/marketplace/partner-dir?filter=sort%3D0%3BpageSize%3D18%3Bonly...

 

Roche_0-1655976030988.png

 

Looking at the code there is certain information that can be webscraped from the partner details on the current webpage.  After having this information, if you click on a partner it will take you to another webpage - this is the information that I am looking to webscrape.  However, I do not see any link in the code that leads to this webpage.  Can someone please help me on this one?  

 

Would appreciate your assistance!

 

Thank you, 

Rouche

6 REPLIES 6
markcurry
12 - Quasar

Hi @Roche 

 

I think the website uses this API call to get the list of Partners.  

 

https://main.prod.marketplacepartnerdirectory.azure.com/api/partners?filter=sort%3D0%3BpageSize%3D18...

 

Hope that helps

Roche
8 - Asteroid

Hi @markcurry 

 

Thank you, this certainly helps me.  

 

Would like to ask though, since I am not familiar with webscraping using a an API directly - The link that you provided is a list of 18 partners' information.  But how would I get the many other partners' information if this url does not seem to return it?

 

Thank you.  Appreciate your help.

 

Rouche

markcurry
12 - Quasar

Hi @Roche 

 

It seems a tricky one.  You'd think the reason you're only seeing 18 partners, is because there's a parameter (pageSize=18) , but removing this will still only return 18 partners.  

I had a look at the website in developer tools in Chrome.  So after it displays the first set of 18 results, when you click Next, then the API has an additional 'pageOffset=18' parameter  (then pageOffset=36, 54 and so on) to get the next 18 results.

 

https://main.prod.marketplacepartnerdirectory.azure.com/api/partners?filter=sort%3D0%3BpageSize%3D18...pageOffset%3D18%3BonlyThisCountry%3Dtrue%3Blat%3D41.383%3Blng%3D2.183%3Bcountry%3DES%3Bradius%3D1000%3Blocname%3DBarcelona%252C%2520Barcelona%252C%2520Spain

 

I tried with the attached workflow to download the additional pages, but it just downloads the first 18 for each page.  So I'm not sure how you get the download tool to continue from with the next set.  I tried adding this User-Agent header to the download tool as @DavidP suggested here to mimic a browser.  I also tried adding references to the  ai_session, and ai_user cookies that the api uses, but that didn't work either.  

 

@DavidP any ideas here?

 

Thanks

 

 

 

 

DavidSta
Alteryx
Alteryx

Just as an idea we don't do the work twice, can we focus on one thread for this topic?

https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Webscraping-data-that-does-not-provide...

Roche
8 - Asteroid

Hi @markcurry .  Thank you for helping, appreciate it.  Since it looks like it will be somewhat of a struggle etc. I will put this project aside for a while or perhaps not continue with it.

markcurry
12 - Quasar

@Roche , while not ideal, you could use the links from the workflow to display the various page results from the API in a browser, then copy the results for each page into a text file, then use Alteryx to process the text files.   Hopefully someone else has a more elegant solution.

Labels