Hi there, @patrick_digan
I have an Excel file with a column filled with +4000 URLs each one in a different cell. I need to use Alteryx to open with Chrome and scrape some of the data from a website. past them in excel.
And then do the same step for the next URL. Could you please help me with that?
Thanks,
Simon
Hi @Simon1187 ,
Yes, you need to build a batch macro to loop through each URL. This will return the HTML per row for every URL. You then simply need to parse the information from the URL, but with 4000 URLS, assuming those URLs do not contain the exact same HTML, you will need to build it individually.
However, if they are the same, then you only need to build it once for all.
If you care to share the URLs and let me know what you need I can get you started.
M.
Hi @mceleavey
Thanks for your prompt response. I think I should clarify a little bit. When I said 4000+ URLs, I meant the first parts of the URLs are the same. I have attached an Excel file for you to give you some idea about that. Unfortunately, as the website is internal, it is not accessible from the outside. However, I have put a screenshot of the page.
The page has fixed parts like (Profile views, phone calls, etc) but the numbers beside them are dynamic.
The goal is I could read those parts and save them in an Excel file. The data needs to be read once per day.
I appreciate it if you could give me further assistance on this case, please.
Thank you!
Simon
Hi @Simon1187
If possible, it's always better to go for source data as opposed to scraping data from a web page. Since your websites are internal, I would suggest reaching out to the people who maintain them and working with them to get access to the data that's used to build the pages.
Another option is to try to call the webpages will different options. Your current URLs look like this
https://X.COM/112949/profile-analytics?from=2018-03-01&graph=bar&sort=monthly&to=2021-11-03
The part in bold is specifying that you want to to receive a bar graph. Maybe there's an option that you can specify to return the data in table form or csv. Here again, you have to reach out to the people that maintain the data to find out what's available.
Dan
Hi @danilang
Thanks for your reply. Yes, that would be the best option if it was possible. But unfortunately, that data is being generated by an external group and all we can see in our team is something like the page that I have shown the screenshot of that.
Thank you!
Simon
in this case, just feed the URL into the download tool.
This will download all the HTML into the DownloadData column.
You then need to use regex etc to parse out the bits you want.
Obviously, we can't help with that as we don't have access.
M.
Hi @Simon1187
Data displayed by modern browsers are generally not the result of a simple call. You may find that the graph is generated by Javascript script from raw data embedded in the HTML. There could also be call-backs with initial call returning the skeleton of the page containing subsequent URLs used to retrieve the data.
Check out this post. It's the first solved one in the suggestions at the top of this page. It contains various methods to try and scrape the data. They range from fairly simple if the data is actually embedded in the html that is received from the initial call to very clever with @cmcclellan' solution of finding a second URL in the HTML and calling that to retrieve the data.
In any case, you should reach out to the external group. Maybe they've had this request before and have a simple solution.
Good luck
Dan
Hi @mceleavey,
Could you please help me with the parse of the file that I have attached?
Thanks,
Simon