Hi RegEx Pro,
I'm trying to build a RegEx to tease out the name of the city without the extraneous info at the end. For example, here's my table:
HESPERIA CA
HESPERIA, CA
HESPERIA. CA
HESPERIA,, CA,
BOULDER& NV
BOULDER&NV
What I want to capture is HESPERIA and BOULDER leaving out the characters. The way I can tell that it's the end of the city name is that we have char(s) and/or space(s) before CA or NV. I've been doing this the long way by using RegEx and Text To Columns with one character at a time. I'm trying to figure out a way to factor in other miscellaneous characters before CA or NV that I might not be aware of.
Could somebody help me figuring out an expression in RegEx that will encapsulates all the miscellaneous characters before CA or NV and leaves the city name untouched?
Thanks,
K
Solved! Go to Solution.
How do you get on with the following, @knnwndlm? You’d need to be using the RegEx tool in Parse mode:
(\w+).+
Hi @knnwndlm,
Another item you might try is cleaning all punctuation with a Data Cleansing tool, then trimming off the two characters on the right of your original value. By doing it this way there is no Regex required, and you can pull cities with a space in the name cleanly as well. Please see the below workflow for an example.
Best,
Hi @DataNath ,
For New York, (\w+).+ only picks up New and left the York out. Also, I'm not sure I understand the entire expression there. I'm just foraying into RegEx. Could you please help me understand the expression?
Thanks,
K
Thanks @ConnorK!
If I were to do this using RegEx, what's the best approach to cover all the punctuations? I'm trying to learn RegEx and that's why I want to use it in my workflow.
Thanks @DataNath!
This works! Could you please help me understand the expression as I'm still learning it?
No problem @knnwndlm, happy to help! The current expression breaks down as follows:
(.+\w) < Captures all characters up until the last occurrence of a word character (denoted by \w). The reason it doesn't continue on until the other letters i.e. NY/CA/NV is because after the capture group i.e. everything bracketed, we state:
.+ < One or more of any character (the . is a wildcard for any character other than new line, the + means 'one or more'), followed by:
\w{2} < Two (made explicit by the {2}) word characters - in this case the state abbreviations.
We could've also used something like (.+\w).+[A-Z]{2} to be more specific to 2 capital letters but for a use like this I thought \w would be fine. Whilst picking up RegEx, and even for referring back to often when getting the hang of it, cheat sheets like so are always great:
https://cheatography.com/davechild/cheat-sheets/regular-expressions/
I always tend to test my expressions before running them in Alteryx as well, with my site of preference being https://regex101.com/
Thanks @DataNath!
Hi @DataNath,
I noticed that some city names were chopped apart based on (.+\w).+\w{2}. Here are a few examples:
ONTARIO becomes ONTA
COMPTON becomes COMP
TULARE becomes TUL
RANCHO CUCAMONGA becomes RANCHO CUCAMO
MONTEREY PARK becomes MONTERY P
Guessed I forgot to include the case where there's no CA/NV/NY. How do tell RegEx to ignore such cases?
Thanks,
K