So I've been given a pretty unorganized data set with a bunch of address info all in one column. I figured out how to parse out zip and state using ReverseString then a Text to Column. I having a much harder time getting the City and Address because as you can see the info is all over the place some having customer info and some having PO Boxes.
I'm wondering if there is anyway I can parse out "city" by looking for patterns like parsing everything left of an two uppercase letters and right of everyword like "AVE, ST, LN, RD etc." Possible similar thing for address with parsing between "RD, ST etc" and a set of digits.
Not sure if thats possible but my goal was to eventually use the Alteryx geocoding app. Thanks for the help!
Solved! Go to Solution.
Parsing your address data is an art. you might want to use regular expressions in the logic as they perform faster and require fewer lines of code.
You might want to look for patterns of data (includes PO or BOX) and have alternate sets of logic for those streams of data.
I'd try coding for 80% of the data to look wonderful (this should come quickly) and see if the remaining 20% can be corrected without over-engineering a solution.
Just some random thoughts for you...
Mark
Any tips for what those expressions might look like within Regex when parsing left or right of things like PO BOX or set of digits? I don't have much expeience yet with Regex. Thanks though!
ok. here are some 'quick' examples how to get to the zip, state and po box:
ZIP: regex_replace([text],"(.*)(\d{5}$|\d{5}-\d{4}$)","$2") STATE: regex_replace(text,"(.*\s)([[:alpha:]]{2})(\s\d{5}.*)","$2") BOX: regex_replace(uppercase(text),"(.*)(PO\sBOX\s\d{1,5}|P.O.\sBOX\s\d{1,5}|POB\s\d{1,5})(.*)","$2")
i've saved a workflow in version 10.5. If you are running version 10.1, you'll need to open it with NOTEPAD and edit the XML header from 10.5 to 10.0.
the regular expressions look for 2 versions of ZIP (99999 or 99999-9999) followed by an end of field character. then they look for SPACE followed by 2 letters followed by a a space and 5 numbers as the state. the 3rd formula looks for a variety of "PO Box" followed by 1 to 5 digits followed by other stuff.
Hope that this helps. you can google REGEX or REGULAR Expressions and find lots of help. I also use the REGEX COACH (you'll find a link in alteryx help) to construct and test expressions.
Cheers,
Mark
Here's another post that might have some ideas...
Thanks! that works well when the PO box is present but since not many of the addresses have PO boxes do you know of a tool that could nullifies the cell when a word/words isn't in the data?