Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Automation and scraping a website

Simon1187
9 - Comet

Hi there, @patrick_digan 

 

I have an Excel file with a column filled with +4000 URLs each one in a different cell. I need to use Alteryx to open with Chrome and scrape some of the data from a website. past them in excel.

And then do the same step for the next URL. Could you please help me with that?

 

 

Thanks, 

Simon

8 REPLIES 8
mceleavey
17 - Castor
17 - Castor

Hi @Simon1187 ,

 

Yes, you need to build a batch macro to loop through each URL. This will return the HTML per row for every URL. You then simply need to parse the information from the URL, but with 4000 URLS, assuming those URLs do not contain the exact same HTML, you will need to build it individually.

However, if they are the same, then you only need to build it once for all.

If you care to share the URLs and let me know what you need I can get you started.

 

M.



Bulien

Simon1187
9 - Comet

Hi @mceleavey 

 

Thanks for your prompt response. I think I should clarify a little bit. When I said 4000+ URLs, I meant the first parts of the URLs are the same. I have attached an Excel file for you to give you some idea about that. Unfortunately, as the website is internal, it is not accessible from the outside. However, I have put a screenshot of the page.

010.JPG

 

The page has fixed parts like (Profile views, phone calls, etc) but the numbers beside them are dynamic.

The goal is I could read those parts and save them in an Excel file. The data needs to be read once per day.

I appreciate it if you could give me further assistance on this case, please. 

Thank you!

Simon

danilang
19 - Altair
19 - Altair

Hi @Simon1187 

 

If possible, it's always better to go for source data as opposed to scraping data from a web page.  Since your websites are internal, I would suggest reaching out to the people who maintain them and working with them to get access to the data that's  used to build the pages.

 

Another option is to try to call the webpages will different options.  Your current URLs look like this

https://X.COM/112949/profile-analytics?from=2018-03-01&graph=bar&sort=monthly&to=2021-11-03

The part in bold is specifying that you want to to receive a bar graph.  Maybe there's an option that you can specify to return the data in table form or csv.  Here again, you have to reach out to the people that maintain the data to find out what's available. 

 

 

Dan

mceleavey
17 - Castor
17 - Castor

Good shout, Dan.



Bulien

Simon1187
9 - Comet

Hi @danilang 

 

Thanks for your reply. Yes, that would be the best option if it was possible. But unfortunately, that data is being generated by an external group and all we can see in our team is something like the page that I have shown the screenshot of that.

 

Thank you!

Simon

mceleavey
17 - Castor
17 - Castor

@Simon1187 ,

 

in this case, just feed the URL into the download tool. 

mceleavey_0-1636805404151.png

 

This will download all the HTML into the DownloadData column.

You then need to use regex etc to parse out the bits you want.

 

Obviously, we can't help with that as we don't have access.

 

M.



Bulien

danilang
19 - Altair
19 - Altair

Hi @Simon1187 

 

Data displayed by modern browsers are generally not the result of a simple call.  You may find that the graph is generated by Javascript script from raw data embedded in the HTML.  There could also be call-backs with initial call returning the skeleton of the page containing subsequent URLs used to retrieve the data. 

 

Check out this post.  It's the first solved one in the suggestions at the top of this page.  It contains various methods to try and scrape the data.  They range from fairly simple if the data is actually embedded in the html that is received from the initial call to very clever with @cmcclellan' solution of finding a second URL in the HTML and calling that to retrieve the data. 

 

In any case, you should reach out to the external group.  Maybe they've had this request before and have a simple solution. 

 

Good luck

 

Dan  

Simon1187
9 - Comet

Hi @mceleavey

 

Could you please help me with the parse of the file that I have attached? 

 

Thanks, 

 

Simon

 

 

 

 

Labels