Alteryx Designer Desktop Discussions

Anudeep_Yalamuru · ‎09-11-2024

Hi All,

I need to get the details from this webpage - "https://www.gov.ie/en/directory/category/495b8a-schools/?school_roll_number="

I need to extract all school details like in this page

"https://www.gov.ie/en/directory/page/5pcuno-2aoze9-/"

Could someone here please assist me on getting all the 4000 odd schools details please.

Thanks,

Anudeep

Deano478 · ‎09-11-2024

@Anudeep_Yalamuru If I was you I would really just use Python for this as Alteryx has limited web scrapping capabilities you could use libraries like Selenium, bs4, Fake User Agent, requests, logging etc... to do this also with Python you can somewhat customise your outputs.

Anudeep_Yalamuru · ‎09-11-2024

@Deano478 Thanks for your answer. I'm not an expert in Python and that is one of the main reasons for asking it here, to solve by Alteryx.

Regards,

Anudeep

Anudeep_Yalamuru · ‎09-11-2024

@MikeLR @mceleavey @atcodedog05 @danilang @le_luu Please help and assist

mceleavey · ‎09-11-2024

Hi @Anudeep_Yalamuru ,

of course we can help with this, but you will need to tell us where you're stuck. If you need consultancy, you can reach out, but this is not the place to ask for people to do your work in its entirety.

Here is a good beginner article on how to get started with web scraping by @le_luu :

https://community.alteryx.com/t5/Engine-Works/Web-Scraping-in-Alteryx/ba-p/1173075

Read through that and apply that to your workflow.

If you get stuck or have problems, let us know specifically what those issues are.

If you require further help feel free to DM me.

M.

Anudeep_Yalamuru · ‎09-11-2024

@mceleavey Hi, I tried doing it as per Le Luu article. But im getting an error in the first part itself. That is the main reason I reached out here. Please find the screenshot for the error.

Thanks,

Anudeep

mceleavey · ‎09-11-2024

@Anudeep_Yalamuru

I've put something together for you which will get you over that first hurdle.

This will allow you to download all records from all sub-urls (pages) and I've started the process for you:

I would suggest reading up on regex and text parsing in general. That should get you where you need to be.

I hope this helps,

M.

Anudeep_Yalamuru · ‎09-11-2024

Thanks @mceleavey Appreciate it. I will work through with this.

le_luu · ‎09-11-2024

I agree with @mceleavey. You should try to extract the data by using Regex and try to clean the data. From the workflow that @mceleavey built, you can get the id of each school link (See the attachment files above). After getting the ID of each school, concatenate it with the link: https://www.gov.ie/en/directory/page/. Then create a batch macro to go through each link, and extract data. Finally, union all rows, and clean data, you will get the full dataset.

Good luck!

Alteryx Designer Desktop Discussions

Web scrapping - Multiple Pages