Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Web Scraping -- Country and Country Code

smoskowitz
12 - Quasar

Hello --

 

I am trying to scrap some country/country code information from the following site: https://www.irs.gov/e-file-providers/foreign-country-code-listing-for-modernized-e-file

 

I can get the country using the regex tool, but can't seem to figure out how to get the next piece. Below is what I am trying to get:

 

2017-10-11_9-47-28.jpg

 

Here is what I have done so far:

 

2017-10-11_9-50-21.jpgI have no regex skills so this is just Googling around. Let me know what I am missing. I should have about 258 rows of data.

 

Thanks,

Seth

4 REPLIES 4
Kenda
16 - Nebula
16 - Nebula

Hey @smoskowitz! I created a small workflow that may be able to help you out. I first split the DownloadData field into rows based on new lines. I used a Formula tool to parse out the parts of the field we wanted to keep. Then a Multi-Row Formula tool and and a Filter tool to get the country code next to the corresponding country. Hope this helps!

GavinAttard
11 - Bolide

Hi @smoskowitz

 

Quick and crude but attached should do the trick

 

cheers

 

Gavin

Alteryx Everything, Leave no one behind.
smoskowitz
12 - Quasar

Thank you! What exactly is this doing:

 

REGEX_Replace([DownloadData], '(.*">)(.*)(<.*)', "$2")

Kenda
16 - Nebula
16 - Nebula

@smoskowitz I'm not sure how familiar you are with REGEX_Replace, but it has three necessary parameters: the field name, the pattern you're looking for, and the replacement value. Here, basically this is saying look for the words between ">...< and only keep that. 

 

To be more specific, the pattern here groups your field into three parts, using parenthesis to separate each part. The first part is everything until (and including) the ">. The second part is everything after the "< and before the <. The third part is everything after (and including) the <. The $2 then tells Alteryx to only keep the second grouping. Hopefully that makes sense!

Labels