community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Data Scraping from Wikipedia

Meteor

Hello, I am bit new to alteryx. I need help with data scraping. I firstly wish to get the table located on

website: en.wikipedia.org/wiki/List_of_North_American_stadiums_by_capacity

and then by using the hyperlinked name of the stadium in the that big table, I want to visit to those respective stadium's wiki page and get their co-ordinates.

I thank you all in advance for extending help.

Alteryx Certified Partner
Alteryx Certified Partner
Depending what the objective here is then I may suggest a different approach.

If it’s for learning then sure this can probably be achieved completely with Alteryx.

If it’s for an actual project I’d probably consider grabbing the stadium list from wiki using a good old copy paste, and then use one of the many geocoding apps available on the Alteryx gallery to perform the geocoding, taking the stadium names and returning the appropriate latitude and longitude. If you search google maps on the Alteryx gallery there are a wide array of options available.

Ben
Meteor

Its an actual project, so specific lat and long would be required. I got the list of stadiums from Wikipedia. However, generating lat/long for them seems to be a challenge for me at this point.

Alteryx
Alteryx

Although you could do this without REGEX, it is your friend here!!

 

Tokenize(to rows) the DownloadData Field at <table(.*?)</table> and then again (to rows) for <tr(.*?)</tr> and then (to columns) <td(.*?)</td>

 

The second column will give you the links in their tags, so you will need to parse that column with <a href="(.*?)".*?>(.*?)</a>

 

A formula tool to construct the full URL from the parsed data, and then feed that back to the Download Tool. 

 

The image below gives an idea of the process, I've collapsed the container as it would be confusing... I just put together a quick set of tools to get the pic, not necessarily working.

 

Parse_Wiki_Table.png

 

You now have the HTML of all the stadiums pages... I haven't looked at those, but look for the co-ordinates and REGEX is the easiest way to pull those out... they should be pretty standard across the pages.

 

Kane

Labels