nav[aria-label="Primary Navigation"] { padding: 0; & ul { list-style: none; width: 100%; display: flex; flex-direction: row; justify-content: start; align-items: start; gap: 30px; padding: 0; & li { margin: 0; } & ul li { list-style: none; } } }

Challenge #54: Data Prep Address Parsing

GeneR

The link to last week’s challenge (exercise #53) is HERE.

This week’s challenge is to parse out City, State and ZIP code from some unformatted input data.

The data is in a nonstandard format - it is missing commas and some city names are two words and some city name are only one word, making parsing a challenge. You need to be able to parse out the city name, state, and zip code if available.

Your goal is to create a process that will transform the data into a data table with separated columns for City, State, and ZIP.

Enjoy and as always I look forward to seeing some creative solutions.

challenge_54_start_file.yxmd

challenge_54_solution.yxmd

Parse

Join

Preparation

Advanced

Data Preparation

Basic

Accepted answers

All comments

mceleavey

Seeing as we are Alteryxing here, I decided to go a little further and "repair" the gaps in the data, basically because it's easy in Alteryx, so why not?

Spoiler

I began by deciding I would take the existing address components and feed them into the open Google API and returning the full results.
First, I replaced the spaces in the Address Text with + signs and appended this to the API url. I then used a formula to create the search string and gave each row an ID:

Prepare initial search string for download.PNG

Prepare initial search string for download.PNG

This string was then fed into the download tool which brought back all of the full address information. I then used the Regex tool to parse to rows each search string, then text to columns to seperate the address components:

Download and parse.PNG

It was then a simple task of cleaning up the data, writing a quick formula to take the short version of the State and removing any unwanted columns:

Cleanup and sorting.PNG

Then using Crosstab to put the data back into rows, and renaming the columns:

Re-arrange accordingly and rename.PNG

This is the output:

I've attached the workflow but you will need to input your own API key.

Challenge 54 Solution.yxmd

GeneR

Nice!

philip_ball

Hi,

So my license doesn't have access to the address parsing tools, so I took a more manual approach.

Spoiler

ex54 flow.JPG

Basically 3 Regex parsers 'power' it. The Zip and State are simple enough and follow an easy to parse form (ie: 2 capital letters, 5 numbers at the end of the line).

In order to split the City name from Street, I viewed the street type (ie: Drive, Circle, St, etc.) as a type of delimiter between the two. To turn it into a single delimiter I created a 'collection' of street types and Find/Replaced them with '|' symbols in the original lines. I then separated out the street names fully, and then parsed out the City name with a regex.

challenge_54_start_file_complete.yxmd

TomWelgemoed

Great stuff Phil for doing this without address inputs.

andrewdatakim

Hi Everyone,

I built my model based on receiving the information in the two formats provided (w/ and w/o zip codes), which run through a parse tool to separate out the state and zip code.

Address Parse Workflow

RegEx Address Parse

The next step was to divide the two into those with and without zip codes. The ones with zip codes can be referenced to a zip code repository and joined giving you all of the information you are looking for with the correct formatting. Those without zip codes need to be cross referenced using an Advanced Join (this tool allows you to specify multiple criteria for a cross join https://gallery.alteryx.com/?_ga=1.45648100.198713632.1460495285#!app/Advanced-Join/547f8df96ac90f0f2ca5e439)

Advance Join Logic

The next set of formula tools removes the cities from the addresses and leaves the Selects and the Union for cleanup. The addition that I would make for further improvement is to add the Google API to run the address records where zip code was not given to add the zip code, at which point any zip codes that you were given that are either not in your repository or are invalid could be corrected/ stored.

The major weakness of this workflow is the initial parse which can currently only process the two options. I would want to create a repository for the most common submissions and insure that it stays case insensitive.

Address Parse from Raw String.yxzp

NicoleJohnson

My solution. I got a bit stuck (clearly, i need a crash course in RegEx, and don't have access to the address parse tools either), so I ended up deciding to create a text input of common street names (Street, St, Drive, Dr, etc.), which allowed me to split out the number & street after using the Find & Replace tool. From there, string formulas let me do the rest. For this to work for any address, I'd likely need to expand the common street names text input to include additional ones (Place, Close, Court, etc.) It's not elegant, and I'm not convinced yet that it would work 100% of the time... but it worked.

Spoiler

challenge_54_NicoleJohnson.yxmd

SeanAdams

solution attached. Bit bashful about this one 'cause it's just not pretty (in fact - went back and re-did it once I checked the solution afterwards :-))

Love Mark's idea ( @mceleavey to use the Google API to do some cleanup)

Spoiler

Did a series of formulas to replace the street
Then found state and zip in regex pairs
- parse out the zip - then replace it in the original text
- parse out the state - then replace it in the original text

as you can see in the screenshot below - felt a bit silly afterwards so I went back and redid it and included that too :-)

challenge_54_SeanSolution.yxmd

estherb47

Had some challenges with the RegEx when I was trying to do it with one RegEx parse tool (could not get it to select from the numbers through the Street/Road/Ave/etc. Still managed to clean this with 3 tools!

Spoiler

Spoiler

First RegEx tool replaces the Street/St/Avenue/Ave/Road/Rd/Circle/Drive/Dr etc, with a +

Then used a formula with RegEx_Replace to remove all of the characters preceding and including that +

Next RegEx parse with a Parse method to grab all of the text before the two uppercase letters as the city, the 2 uppercase letters as the state, and everything afterwards as zip. Used \s outside of the marked groups to remove extra spaces from the results

challenge_54_EHB_solution.yxmd

LordNeilLord

This was most definitely not a beginner exercise!

Spoiler

challenge_54_LNL.yxmd

A_Twa

This one was a tough one for me - RegEx is not something I'm very proficient at (yet!).

challenge_54_start_file_AJT.yxmd

MsBindy

I parsed using the named groups with words Circle, Street, Road, etc. For some reason had huge problems with Ave, Avenue, and Road...but eventually through trial and error got those words to work.

challenge_54_MsBindy.yxmd

nick_ceneviva

Solution attached. Couldn't seem to get the RegEx formula right to handle the street information, so I used a text input tool to replace the street (road, ave, etc) with a "|". Then used RegEx match in the filter tool to handle the records with zip codes differently from the ones without.

challenge_54_Ceneviva.yxmd

Elena_Caric

My Solution

challenge_54_start_file.yxmd

Natasha

Here is my solution. City parsing works for this records, but I think it might be unreliable for other possible address combinations.

Spoiler

challenge_54_NK.yxmd

jamielaird

Here's my solution.

Spoiler

challenge_54_JL.yxmd

patrick_digan

Spoiler

I ended up using the Invisio Geocoder (which runs through Google) then the reverse geocoder (Alteryx macro) and then the parse address tool.

challenge_54_start_file.yxzp

marcreid

Solution

Spoiler

challenge_54_MR.yxmd

dominiklz

Spoiler

RegEx: (^\d+) (\w+ \w+|\w+ \w+ \w+) (\w+|\w+ \w+) (\w{2})( \d{5}$|$)

took a lot of testing, but managed to get all information parsed out correctly using just one RegEx tool

challenge_54_Dominik.yxmd

samN

Simple Parsing.

challenge_54.yxmd

SGolnik

I chose this one specifically to practice my regex. I didn't get the zip code the same way, but still arrived at the same result.

challenge_54_mysolution.yxmd

ggruccio

Parsed it a ton!

challenge_54_gg_finish.yxmd

LandonG

Solution attached.

challenge_54_start_file.yxmd

philipmannering

Here's my solution. The regex is necessarily clumsy to parse out the street name.

Spoiler

challenge_54 - Data Prep with Address Parsing.yxmd

jasperlch

Solution attached.

challenge_54_JL.yxmd

paul_houghton

Got myself tied up on how to identify the street dynamically, in the end just did it manually.

Spoiler

challenge_54.yxzp

blairmbailey

Solution attached - thank you!

challenge_54_BB.yxmd

Suzanne

this solution isn't the most efficient but it works -:)

challenge_54_Suzanne_file.yxmd

msicak

if I can have a late Christmas present I want Regex skillz. The solution works but is by no means as flexible as some of the others here.

Spoiler

challenge_54_MS.yxmd

BrendaS

I didn't like the manual entry portion, but I couldn't figure out any good way around it.

challenge_54_Brendas solution.yxmd

derekbelyea

Spoiler

challenge_54_JDB_AnalytixStudio.yxmd

JosephSerpis

Challenge Completed

Challenge_54_Joe_Serpis.yxmd

TeePee

Spoiler

I found this one very tricky. In the end, I had to use a bit of a hack by manually adding a couple of the cities to a master list of cities which I took from the census bureau website. I've also added a double-check: if the data set increases and there are more cities which aren't present in the census bureau list, they will be marked as "unknown" so the Alteryx user can add the cities to the master list. Not very elegant, but it works.

challenge_54_start_file_TP.yxmd

dsmdavid

Regex + Data Cleansing. For a larger dataset a lookup for streets, alleys, circles, etc. would be needed --for this one I thought it was ok to hardcode the couple present.

Spoiler

solved#54.yxmd

ewelch531

i decided to parse out street address too so I could learn a little more regex

challenge_54_ElizabethBillingsWelch.yxmd

pasccout

Here is my solution... based it on the list of cities and counties from the following web site...

https://simplemaps.com/data/us-cities

challenge_54_my_solution.yxmd

AntonioGonzales

Dear @mceleavey,

Hope you are very well.

I just want to thank you a lot for this workflow!

My mind is bubbling with it. :-)

I got a Google API key (my first time using Google APIs) and I'm testing your workflow and finally, I am starting to understand when I have to use '?' in my regex.