Hello,
Looking for some help finding the full (or maybe absolute) URL's to do some web scrapping.
Here is the basic URL link
https://apps.web.maine.gov/cgi-bin/online/bablo/licensing/search_large
I referenced Web Scrapping - Unchanged URL
but cannot find what I should be looking for in the developer tools. I ran a blank search and then went to Network tab but I don't see what I should be looking for. The other post mentioned: ?searchstring, but I don't see that.
Specifically I am interested in "BRW, DIS, SMB, SMD" so it could be 4 different URLs or leave the field blank and I can filter the data later. I would prefer to filter on the front end however.
Once I run the search I am trying to pull data from two levels down.
This is the first level: I want to select the link that brings me to the business name information.
Here is the second level and the fields I am trying to pull data for.
I do get a URL for the individual business pages Sea Dog Brewing Co. but there is no way to predict the URL especially if you don't know the Company name.
Any help and would be great. I've been on Alteryx for about a week now, so I do not know much about web scrapping/API yet.
Solved! Go to Solution.
Hi @ibesmond ,
I got carried away and developed everything you need, I think 🙂
I'm attaching a solution where I've commented all the way to help you understand what is happening each step of the workflow.
I've used a lot of things here and I'm going to share some links to help you understand the steps.
First, I needed to check how the search button works and I've used HTTP Trace extension from Google to simulate it.
From here, I know that it is a POST method with license_number and submit as payload.
After getting to the next page, I've inspected the HTML code to see if I was able to see how to get to the next page with all the informations. Luckily, it was possible.
From that, I needed to create an automatic form of getting all the license name links and I've used regex to do it.
How to: https://community.alteryx.com/t5/Alteryx-Knowledge-Base/Tool-Mastery-RegEx/ta-p/37689
Test Regex expressions: https://regex101.com/
After that, I've inspected one more time the HTML source code and was able to identify all information from the company.
Regex all the way once again to separate all the patterns identified.
Best,
Fernando Vizcaino
Hi @fmvizcaino Thank you so much. Quadruple thumbs up from me and all those that will benefit from your solution. This will serve as a complete training guide.
One question I wanted to ask. You set up the input to filter for BRW? In your opinion, would you suggest I duplicate this entire workflow for each license type, or would you create multiple rows in the text input tools as shown below? When I run the workflow as shown, I get 724 results which contains quadrupled duplicates. Manually the searches return 181 total: (BRW-13, DIS-1 (NULL), SMB-139 & SMD-28)
I'm not sure what is causing the duplicates. Do you know how to fix this? Or would you recommend a better way to union the data, and at what point in the workflow could the data be joined? You have already given me so much. If I could just ask this last challenge; I truly appreciate it. Thanks Fernando.
Hi @ibesmond ,
My bad, I started using the one option and then forgot to configure properly for you to insert multiple lines.
You can use one of the options you showed but not both at the same time. The append tool is working as a line multiplier, so it would have the same results as if you were multiplying lines in your URL dataset.
I would suggest you to insert all filters in your text input tool and to remove the license_number column from your URL dataset. After done that, you need to reconfigure your apeend and download tool once again, keep in mind that the payload must have the license_number title column as a parameter.
I'm reattaching the solution with those configurations as well.
Best,
Fernando Vizcaino