Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Webscraping from Gov.uk site

JamesGray
7 - Meteor

Hi All,

 

I am trying to carry out a web scraping task from this UK Govt website: https://www.applytosupply.digitalmarketplace.service.gov.uk/g-cloud/search.

 

I would like to create a database in essence in which for each link on the site I can search for a keyword or phrase to be able to filter the list to those which are relevant to myself.

 

For example all pages which contain the word SOC2.

 

Unfortunately I have limited web scraping experience but I believe this is an acheiveable task so I would appreciate any advice.

6 REPLIES 6
caltang
17 - Castor
17 - Castor

Hi @JamesGray,

 

Going through the website, what I can recommend is for you to use a Text input with the URL that you want (e.g: https://www.applytosupply.digitalmarketplace.service.gov.uk/g-cloud/search?page=3) then adding a standard macro to update the page= number to the last page. 

 

The download tool can be connected to the text input, then it will download the data that you want. You can then change the data to rows with the delimiter, and filter out your keyword SOC2. However, from first glance, I don't see much SOC2 at all in the first three pages. Can you provide more examples?

 

In addition, perhaps you can attach a drawn expected output? That will be most helpful. 

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/
JamesGray
7 - Meteor

Hi @caltang,

 

My understand of your solution would provide the text seen on the page results.

 

I am looking to be able to get the text from each individual linked page. For example https://www.applytosupply.digitalmarketplace.service.gov.uk/g-cloud/services/255816056586378.

As you can see from this example there is mention of "SOC 2" on the first bullet point under Features subtitle.

 

In terms of an output I would be looking for below is a rough example using the above link as reference:

JamesGray_0-1675803383674.png

 

Hopefully this helps

 

As an aside I tried to use the text input and download tool for the webpage I linked above but my "download data" column was empty. Not sure what I was doing wrong.

markcurry
12 - Quasar

Hi @JamesGray 

 

There's 2 parts to what you need to do.  Firstly you need to get the links to the individual linked pages from the main page (and each of 1361 pages), then once you have the individual page links you need to read thoses page, which you can do with a batch macro.

 

Or I'm wondering can you just search for SOC2 on the main page and process those results instead?

 

I've attached a solution which will hopefully do the trick for you, or get you in the right direction.   The main workflow reads the source of each of the pages and gets the individual pages URL.  That URL gets passed to a batch macro which process each page.   I have the batch macro returning, the URL, the page Title and Company as well as the where the website source mentions SOC2 or SOC 2.   

 

I see you're looking to return the text of the website, this is a little tricker as Alteryx is returning the HTML source code, and I'm not sure what text within each page you're looking to return, and whether each page has the same format.

 

The attached workflow just reads pages 1 to 5, so you'd need to change the 'Generate Rows' tool condition expression from 'Pages <= 5' to 'Pages <= 1361' to read all the pages, it will take a while to run to read all those pages.

 

Hope that helps, and gets you most of the way there.

JamesGray
7 - Meteor

Hi @markcurry,

 

Your solution sounds just like what I am looking for. Unfortunately my company is still running Alteryx: 2020.4.5.12471 so I am unable to open it even when clicking okay with using an older version. By any chance is there a way I would still be able to access it?

 

Thanks

markcurry
12 - Quasar

Hi @JamesGray , I don't think there's anything in the workflow or the macro that I sent that won't work on 2020.4.   If you open the workflow and the macro file in Notepad, you can edit the second line....

<AlteryxDocument yxmdVer="2021.3">

 

Change it from 2021.3 to 2020.4 it should work fine.  Hopefully that does the trick for you.

JamesGray
7 - Meteor

Hi @markcurry,

 

That worked great thank you. Workflow is just the kind of thing I was looking for. Also great idea on the search query with term included their to reduce results!

 

I am now looking to try and convert this into an analytic app. I would like to be able to input the search terms to contain in order to have multiple search criteria. For example "SOC2, SOC 2, ISO14001". Preferably with a dynamic column created as a flag for each search term as to whether they were found or not.

 

I thought of including a text input interface tool, specifying comma dividers between terms which would go into a text to column tool to create the search criteria. However, I am not sure how then to specify this in the formula tool as an "X OR Y OR Z" particularly if there is a variable choice in search terms. i.e. one time it is 1 term, other its 5.

 

Also how to include these interface tools in the main workflow which transfer into the macro to make this function correctly. 

 

Any thoughts on this would be great

Labels