community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.
SOLVED

Extract a table from Wikipedia

Meteor

Hi,

 

I tried to use the download tool to extract a table from a wikipedia page: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2

I'm trying to get the big table with country code and country names.

 

 

I've used the download tool to get the data, used Text to column tool with "Split to Rows" and \n as delimiter.

It basically gives me the html code of the page.

 

What's next to isolate that specific table?

This would take seconds in powerquery and easily refresh, it'd be great to have a user friendly tool to grab data from web...

 

Thanks

 

Tibo

Bolide

Hey Tibo,

 

In situations like this, I like to build a custom API via import.io and use it with the download tool in Alteryx. This is free, and can capture data from very complex webpages with little effort. 

 

Cheers,

Ryan

Alteryx Certified Partner
Alteryx Certified Partner

This isn't perfect but it should get you started.

 

Using XML tool, dynamic select, and transpose you can pull the XML data for each specific row of the table you are after.

ACE Emeritus
ACE Emeritus

The attached solution utilizes observed specifics in the HTML to extract the relevant rows generate the Abbr/CountryName list in straight-forward fashion.  Granted, if they tweak the table, results may go bad, but that may well be the case with any solution.

 

(But as noted above, import.io is probably the way to go here: it captures every column in the table).

Try using import.io to set up a parse process.

 

This then presents a JSON API endpoint you can use in Alteryx much more easily

 

For this page I tried the magic.import.io and it worked staright away. https://magic.import.io/?site=http:%2F%2Fen.wikipedia.org%2Fwiki%2FISO_3166-1_alpha-2

 

You need to create an account and save the API. After this go to Export and Simple API integration.

 

For this page my generated API was: https://api.import.io/store/connector/b53ea8c8-6851-4645-8522-bc3f2db005d2/_query?input=webpage/url:...

 

I set up a download tool to grab the JSON and then parse with standard tools.

 

Attached a sample

Double post :(

Alteryx
Alteryx

Hi Tibo,

 

There's a few different ways of Parsing the html table listed here, so I thought that I would include mine. The base of it is around using REGEX to parse each element to a new line (I used to replace each element with a pipe (eg Replace([DownloadData],"<table","|") and then parse on the pipe before I got used to REGEX).

 

I'll also jump on the love for Import.io, as once the page gets a little more complicated than a simple table, Import.io handles those elements better.

 

Kane

Meteor

Thanks Guys,

 

This is helpful, I'm familiar with import.io so will use that as it is the easiest here.

I just can't wait for such functionality to be embedded in a proper Alteryx tool because all the parsing methodologies outlined here are way too long for me, I'd be quicker copying and pasting the table in a text input and refresh it manually :-)

 

Anyway this gives me ideas for other stuff so thanks for spending the time

 

Tibo

 

Labels