Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Web scrapping - Multiple Pages

Anudeep_Yalamuru
8 - Asteroid

Hi All,

 

I need to get the details from this webpage - "https://www.gov.ie/en/directory/category/495b8a-schools/?school_roll_number="

 

I need to extract all school details like in this page

 

"https://www.gov.ie/en/directory/page/5pcuno-2aoze9-/"

 

Could someone here please assist me on getting all the 4000 odd schools details please.

 

Thanks,

Anudeep

8 REPLIES 8
Deano478
12 - Quasar

@Anudeep_Yalamuru If I was you I would really just use Python for this as Alteryx has limited web scrapping capabilities you could use libraries like Selenium, bs4, Fake User Agent, requests, logging etc... to do this also with Python you can somewhat customise your outputs.

Anudeep_Yalamuru
8 - Asteroid

@Deano478  Thanks for your answer. I'm not an expert in Python and that is one of the main reasons for asking it here, to solve by Alteryx.

 

Regards,

Anudeep

Anudeep_Yalamuru
8 - Asteroid

@MikeLR @mceleavey @atcodedog05 @danilang @le_luu Please help and assist

mceleavey
17 - Castor
17 - Castor

Hi @Anudeep_Yalamuru ,

 

of course we can help with this, but you will need to tell us where you're stuck. If you need consultancy, you can reach out, but this is not the place to ask for people to do your work in its entirety.

Here is a good beginner article on how to get started with web scraping by @le_luu :

https://community.alteryx.com/t5/Engine-Works/Web-Scraping-in-Alteryx/ba-p/1173075

Read through that and apply that to your workflow.

If you get stuck or have problems, let us know specifically what those issues are.

 

If you require further help feel free to DM me.

 

M.



Bulien

Anudeep_Yalamuru
8 - Asteroid

@mceleavey Hi, I tried doing it as per Le Luu article. But im getting an error in the first part itself. That is the main reason I reached out here. Please find the screenshot for the error.

 

Thanks,

Anudeep

mceleavey
17 - Castor
17 - Castor

@Anudeep_Yalamuru 

 

I've put something together for you which will get you over that first hurdle.

 

scraping.png

This will allow you to download all records from all sub-urls (pages) and I've started the process for you:

schools.png

 

I would suggest reading up on regex and text parsing in general. That should get you where you need to be.

 

I hope this helps,

 

M.

 



Bulien

Anudeep_Yalamuru
8 - Asteroid

Thanks @mceleavey Appreciate it. I will work through with this. 

le_luu
7 - Meteor

id_each_School_link_2.pngid_each_School_link.png

 

I agree with @mceleavey. You should try to extract the data by using Regex and try to clean the data. From the workflow that @mceleavey built, you can get the id of each school link (See the attachment files above). After getting the ID of each school, concatenate it with the link: https://www.gov.ie/en/directory/page/. Then create a batch macro to go through each link, and extract data. Finally, union all rows, and clean data, you will get the full dataset.

Good luck!

Labels