Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Webscraping data that does not provide a link needed for further data downloading

Roche
8 - Asteroid

Hi everyone, 

 

I have to webscrape data from the following website: 

https://appsource.microsoft.com/en-nl/marketplace/partner-dir?filter=sort%3D0%3BpageSize%3D18%3Bonly...

 

Roche_0-1655811567050.png

I will need some data from the 'card' providing the partner name etc. and the other data is seen when you click on the partner card.  It is this data that I am trying to get, but viewing the code I see there is no link provided to be able to get this data. 

 

Would appreciate help on how to reach this data.

 

Thank you,

 

Rouche

 

 

 

7 REPLIES 7
DavidSta
Alteryx
Alteryx

Hi @Roche,

 

this is definitely not one of the websites making it fun to scrape.

 

As this site has a lot of JavaScript you need a Webbrowser to interpret and execute it. The Download Tool is not able to do it.

So now you could parse the JavaScript by yourself ... what definitely wouldn't make it funnier or you use some 3rd Party Code.

 

Here "Selenium" is a great Framework originally designed for automated Software Test. But you should be able to use it for your purpose as well.

Maybe you can check out this great post from @DavidM  focusing on exact this topic.

https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Python-Code-Tool-Web-Scraping-Dynamic-...

Roche
8 - Asteroid

Hi @DavidSta 

 

Thank you, used this method before with other webscaping.

 

However, what link do I feed into the python tool to be able to scrape the data in the webpage that opens up - since the link is not in the code - not from what I can see.  For every website that opens for further partner details, I need that link.

 

Previously, in the work I have done, the link was provided:

Roche_0-1656063846721.png

 

Also, there is someone else that says this link https://main.prod.marketplacepartnerdirectory.azure.com/api/partners?filter=sort%3D0%3BpageSize%3D18... is the API call for the data I need, which it does give all the information I am looking for, but only the 18 partners on page 1 are listed.  I need many of the results.  Do you have any advice on how I ask the webscaper to move to the next and next page until all the pages have been scraped?

 

Thanks, 

 

Rouche

DavidSta
Alteryx
Alteryx

You would start from the main page and interact like you would do in the brower - so clicking on the specific item.

This triggers some loading operations in javascript, like loading the detailpage.

With this the source code changes and now you should be able to extract the information you are looking for. To find the backend link from the source code is mostly not easy.

Here the developer Tools help you as well. Whit them you can extract the other link you shared.

 

There  are GET Parameters defined.

DavidSta_0-1656066423751.png

Here you can see that pageSize is specified to 18. You can increase it from this point up to 20.

Another parameter which cna be used in this query is "pageOffset" where you can say "skip first n entries". This can be used to iterate through the data.

But it seems like the API is not very static. So for the same query you can get different results.

When only refreshing the webpage you will see that you receive different results. Even changing the sorting order option will not make it returning the same values. It's a "randomized" output every time.

So you need to query it multiple times to have a good likelyhood to capture everything.

 

Looks like Microsoft doesn't want to share their data in an automated way.

Roche
8 - Asteroid

Hi @DavidSta 

 

I have done some webscraping and one API call before.  I am fairly new to APIs.  Do you suggest that I run this as an API call with a loop built into the flow?

DavidSta
Alteryx
Alteryx

In theory it could look like this (Please find the Workflow attached).

DavidSta_0-1656081985500.png

 

You generate a list of many many pages you want to parse. If you are happy and have a good API (not this one) you get the information how many pages, or entries will be there.

By testing I identified PageOffset 90 will show the last page ... at least in the Browser, as there are only 10 entries left and not the requested 18

DavidSta_1-1656082096366.png

But you could create an Iterative Macro, extracing the totalCount and checking if this equals 18. If yes do the next iteration, otherwise stop.

 

Now we come to the bad point.

As mentioned Microsoft is not very happy sharing their data and I'm at the end of my ideas how to solve it directly in Alteryx without additional code like Python in combination with Selenium.

When checking PageOffset = 90 in Alteryx I still get 18 more responses. Even adding all of the Header Information my browser is providing while communicating with the web server does not solve this issue.

 

I hope this brings you closer to your solution, but please don't expect someone in this community is going to write the Python Code for you.

Roche
8 - Asteroid

Hi @DavidSta , thank you for helping me out so much!  For now I will continue working on my other work and put this on the side for a while.  Not sure how the code can be written.  It will require research.

djehuty94
5 - Atom

Hey @Roche, did you find a workaround? I am also looking to extract the exact same data!

 

Labels