Data Scraping from Wikipedia


Hello, I am bit new to alteryx. I need help with data scraping. I firstly wish to get the table located on


and then by using the hyperlinked name of the stadium in the that big table, I want to visit to those respective stadium's wiki page and get their co-ordinates.

I thank you all in advance for extending help.

Alteryx Certified Partner
Depending what the objective here is then I may suggest a different approach.

If it’s for learning then sure this can probably be achieved completely with Alteryx.

If it’s for an actual project I’d probably consider grabbing the stadium list from wiki using a good old copy paste, and then use one of the many geocoding apps available on the Alteryx gallery to perform the geocoding, taking the stadium names and returning the appropriate latitude and longitude. If you search google maps on the Alteryx gallery there are a wide array of options available.


Its an actual project, so specific lat and long would be required. I got the list of stadiums from Wikipedia. However, generating lat/long for them seems to be a challenge for me at this point.


Although you could do this without REGEX, it is your friend here!!


Tokenize(to rows) the DownloadData Field at <table(.*?)</table> and then again (to rows) for <tr(.*?)</tr> and then (to columns) <td(.*?)</td>


The second column will give you the links in their tags, so you will need to parse that column with <a href="(.*?)".*?>(.*?)</a>


A formula tool to construct the full URL from the parsed data, and then feed that back to the Download Tool. 


The image below gives an idea of the process, I've collapsed the container as it would be confusing... I just put together a quick set of tools to get the pic, not necessarily working.




You now have the HTML of all the stadiums pages... I haven't looked at those, but look for the co-ordinates and REGEX is the easiest way to pull those out... they should be pretty standard across the pages.