This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
If you like to do Webscraping with Alteryx like I do, you probably have noticed that sometimes in Alteryx it has its limits, latest when you hit a website where the content lies in Java. Alteryx is very good extracting data from the HTML code and parsing it out to get you the relevant information, but once you hit a website that uses Java, you have a road block.
What do you do now?
One solution would be using Alteryx Python Tool and Selenium in order to scrape the website. This can be very well done, but will of course require Python Coding skills. If you dont have these coding skills (like myself), then what now?
Parsehub could be the solution!
Parsehub is a tool that you can download as a free version and that allows you to specify on the Website what you want to collect from it, here is the look of it:
Using the Parsehub help and some articles you should be able to create a project to scrape a website and with the free Version of it, you should already get quite far (it for exmple supports up to 200 pages for a project).
In my example above, I am scraping the Alteryx Public Gallery for the Apps/Macros/Workflows that are available, their Creators and the URLs of each. The last section where it says "Click each..." is actually a nice functionality inside Parsehub that allows you to move to the next page of a list (by for example clicking the "Next arrow" on the Public Gallery Home view).
Now, the interesting thing with Parsehub is that even in the free version, it exposes you a Rest API that you can use both to start a project, as well as to collect the results in JSON from it. Here is where Alteryx strongly comes into play. You can create a workflow to be able to in the end automatically (through Scheduling on an Alteryx Server) start a project on specific days of the week, month, etc. and then again automatically check the status of the Job, to in the end (once it has finished) collect the results back into Designer.
I have attched several Apps/Workflows to this article, please see here:
1) The Alteryx Package "Pulling_of_Parsehub_Projects+Starting_Specific_Project.yxzp" contains 2 Alteryx Analytical Applications that are chained together. Please unpack them and run "01_Pulling_of_Parsehub_Projects.yxwz" as an Analytical App.
The App will first ask you to put in your Parsehub API Key that you can find in your account information on Parsehub. Click Finish and automatically it will run and the next App will start and ask you to select one of your Parsehub Projects that you want to start.
2) Secondly I included a Workflow called "04_Check_Parsehub_Project_Status+Download_Completed_Data.yxmd" that uses a Macro to check if the Job on Parsehub has finished and then if it has, it will download the results from Parsehub and parse out the JSON.
I hope you can use this for future Webscraping in case the Download tool inside Alteryx is not sufficient.
In my example I used it to scrape the Alteryx Public Gallery which uses Java for all the available tools, etc and I now can re-run this every other week to look what new tools have been added and never miss an interesting tool again, that I did not even know existed. The result you can see here as an excerpt: