Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Extract a table from Wikipedia

Tibo
7 - Meteor

Hi,

 

I tried to use the download tool to extract a table from a wikipedia page: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2

I'm trying to get the big table with country code and country names.

 

 

I've used the download tool to get the data, used Text to column tool with "Split to Rows" and \n as delimiter.

It basically gives me the html code of the page.

 

What's next to isolate that specific table?

This would take seconds in powerquery and easily refresh, it'd be great to have a user friendly tool to grab data from web...

 

Thanks

 

Tibo

8 REPLIES 8
rdoptis
11 - Bolide

Hey Tibo,

 

In situations like this, I like to build a custom API via import.io and use it with the download tool in Alteryx. This is free, and can capture data from very complex webpages with little effort. 

 

Cheers,

Ryan

michael_treadwell
ACE Emeritus
ACE Emeritus

This isn't perfect but it should get you started.

 

Using XML tool, dynamic select, and transpose you can pull the XML data for each specific row of the table you are after.

JohnJPS
15 - Aurora

The attached solution utilizes observed specifics in the HTML to extract the relevant rows generate the Abbr/CountryName list in straight-forward fashion.  Granted, if they tweak the table, results may go bad, but that may well be the case with any solution.

 

(But as noted above, import.io is probably the way to go here: it captures every column in the table).

jdunkerley79
ACE Emeritus
ACE Emeritus

Try using import.io to set up a parse process.

 

This then presents a JSON API endpoint you can use in Alteryx much more easily

 

For this page I tried the magic.import.io and it worked staright away. https://magic.import.io/?site=http:%2F%2Fen.wikipedia.org%2Fwiki%2FISO_3166-1_alpha-2

 

You need to create an account and save the API. After this go to Export and Simple API integration.

 

For this page my generated API was: https://api.import.io/store/connector/b53ea8c8-6851-4645-8522-bc3f2db005d2/_query?input=webpage/url:...

 

I set up a download tool to grab the JSON and then parse with standard tools.

 

Attached a sample

jdunkerley79
ACE Emeritus
ACE Emeritus

Double post :(

KaneG
Alteryx Alumni (Retired)

Hi Tibo,

 

There's a few different ways of Parsing the html table listed here, so I thought that I would include mine. The base of it is around using REGEX to parse each element to a new line (I used to replace each element with a pipe (eg Replace([DownloadData],"<table","|") and then parse on the pipe before I got used to REGEX).

 

I'll also jump on the love for Import.io, as once the page gets a little more complicated than a simple table, Import.io handles those elements better.

 

Kane

Tibo
7 - Meteor

Thanks Guys,

 

This is helpful, I'm familiar with import.io so will use that as it is the easiest here.

I just can't wait for such functionality to be embedded in a proper Alteryx tool because all the parsing methodologies outlined here are way too long for me, I'd be quicker copying and pasting the table in a text input and refresh it manually :-)

 

Anyway this gives me ideas for other stuff so thanks for spending the time

 

Tibo

 

cjones514
5 - Atom

Is it true that the API functionality of import.io is now a paid feature?

 

 

Labels