Hi,
I tried to use the download tool to extract a table from a wikipedia page: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
I'm trying to get the big table with country code and country names.
I've used the download tool to get the data, used Text to column tool with "Split to Rows" and \n as delimiter.
It basically gives me the html code of the page.
What's next to isolate that specific table?
This would take seconds in powerquery and easily refresh, it'd be great to have a user friendly tool to grab data from web...
Thanks
Tibo
Solved! Go to Solution.
Hey Tibo,
In situations like this, I like to build a custom API via import.io and use it with the download tool in Alteryx. This is free, and can capture data from very complex webpages with little effort.
Cheers,
Ryan
The attached solution utilizes observed specifics in the HTML to extract the relevant rows generate the Abbr/CountryName list in straight-forward fashion. Granted, if they tweak the table, results may go bad, but that may well be the case with any solution.
(But as noted above, import.io is probably the way to go here: it captures every column in the table).
Try using import.io to set up a parse process.
This then presents a JSON API endpoint you can use in Alteryx much more easily
For this page I tried the magic.import.io and it worked staright away. https://magic.import.io/?site=http:%2F%2Fen.wikipedia.org%2Fwiki%2FISO_3166-1_alpha-2
You need to create an account and save the API. After this go to Export and Simple API integration.
For this page my generated API was: https://api.import.io/store/connector/b53ea8c8-6851-4645-8522-bc3f2db005d2/_query?input=webpage/url:...
I set up a download tool to grab the JSON and then parse with standard tools.
Attached a sample
Double post :(
Hi Tibo,
There's a few different ways of Parsing the html table listed here, so I thought that I would include mine. The base of it is around using REGEX to parse each element to a new line (I used to replace each element with a pipe (eg Replace([DownloadData],"<table","|") and then parse on the pipe before I got used to REGEX).
I'll also jump on the love for Import.io, as once the page gets a little more complicated than a simple table, Import.io handles those elements better.
Kane
Thanks Guys,
This is helpful, I'm familiar with import.io so will use that as it is the easiest here.
I just can't wait for such functionality to be embedded in a proper Alteryx tool because all the parsing methodologies outlined here are way too long for me, I'd be quicker copying and pasting the table in a text input and refresh it manually :-)
Anyway this gives me ideas for other stuff so thanks for spending the time
Tibo
Is it true that the API functionality of import.io is now a paid feature?