Hi Pro User,
I have a list of cities with names that need to be corrected - extraneous characters after names, wrong spelling, etc. - based on a list the Lookup_Table. Below is a file with two tabs - the Data tab that shows the different stages of cleansing and the Lookup_Table tab that provides the correct info.
In the Data tab, Step 0 is a list of data (a-c city names). Step 1 peels out those items that have the 5 digit zip code. Step 2 removes everything starting with a character after a name. Step 3 looks up the Lookup_Table to find the correct spelling of the city names. Step 4 makes an additional correction for those with incorrect city names.
Here are the steps:
STEP 0 | STEP 1 | STEP 2 | STEP 3 | STEP 4 | ||||
ORIG DATASET | REMOVE ZIPCODE | POSTAL | REMOVE CHARS | POSTAL | TABLE LOOKUP | POSTAL | CORRECTION | POSTAL |
***WILL CALL FOR JOBSITE | ***WILL CALL FOR JOBSITE | TBD | ||||||
0000 | 0000 | 0000 | #N/A | TBD | ||||
29 PALMS, CA | 29 PALMS, CA | 29 PALMS | #N/A | TBD | ||||
93664 | 93664 | 93664 | 93664 | 93664 | #N/A | 93664 | TBD | 93664 |
ADELANTO | ADELANTO | ADELANTO | ADELANTO | ADELANTO | ||||
ADELANTO, CA | ADELANTO, CA | ADELANTO | ADELANTO | ADELANTO | ||||
AGOURA | AGOURA | AGOURA | #N/A | AGOURA HILLS | ||||
AGOURA HILLS | AGOURA HILLS | AGOURA HILLS | AGOURA HILLS | AGOURA HILLS | ||||
AGOURA HILLS, CALIFORNIA | AGOURA HILLS, CALIFORNIA | AGOURA HILLS | AGOURA HILLS | AGOURA HILLS | ||||
AGUANA | AGUANA | AGUANA | #N/A | TBD | ||||
AHAMBRA | AHAMBRA | AHAMBRA | #N/A | ALHAMBRA | ||||
ALHAMBRA | ALHAMBRA | ALHAMBRA | ALHAMBRA | ALHAMBRA | ||||
ALHAMBRA CA | ALHAMBRA CA | ALHAMBRA | ALHAMBRA | ALHAMBRA | ||||
ALHAMBRA CA. | ALHAMBRA CA. | ALHAMBRA | ALHAMBRA | ALHAMBRA | ||||
ALHAMBRA, CA | ALHAMBRA, CA | ALHAMBRA | ALHAMBRA | ALHAMBRA | ||||
ALHAMBRA, CA 91803 | ALHAMBRA, CA 91803 | 91803 | ALHAMBRA | 91803 | ALHAMBRA | 91803 | ALHAMBRA | 91803 |
ALHAMBRA, CA 91806 | ALHAMBRA, CA 91806 | 91806 | ALHAMBRA | 91806 | ALHAMBRA | 91806 | ALHAMBRA | 91806 |
ALHAMBRA, CA. | ALHAMBRA, CA. | ALHAMBRA | ALHAMBRA | ALHAMBRA | ||||
ALHAMBRA, CALIFORNIA | ALHAMBRA, CALIFORNIA | ALHAMBRA | ALHAMBRA | ALHAMBRA | ||||
ALHAMBRA,CA | ALHAMBRA,CA | ALHAMBRA | ALHAMBRA | ALHAMBRA | ||||
ALISO VIEJO, CA | ALISO VIEJO, CA | ALISO VIEJO | ALISO VIEJO | ALHAMBRA | ||||
ALTADENA | ALTADENA | ALTADENA | ALTADENA | ALTADENA | ||||
ANAHEIM | ANAHEIM | ANAHEIM | ANAHEIM | ANAHEIM | ||||
ANAHEIM, CA | ANAHEIM, CA | ANAHEIM | ANAHEIM | ANAHEIM | ||||
ANYWHERE, CA | ANYWHERE, CA | ANYWHERE | #N/A | TBD | ||||
ANZA | ANZA | ANZA | ANZA | ANZA | ||||
APPLE VALLEY | APPLE VALLEY | APPLE VALLEY | APPLE VALLEY | APPLE VALLEY | ||||
APPLE VALLEY, CA | APPLE VALLEY, CA | APPLE VALLEY | APPLE VALLEY | APPLE VALLEY | ||||
ARCADIA | ARCADIA | ARCADIA | ARCADIA | ARCADIA | ||||
AS REQUESTED | AS REQUESTED | AS REQUESTED | #N/A | TBD | ||||
AUBERRY | AUBERRY | AUBERRY | AUBERRY | AUBERRY | ||||
AVALON | AVALON | AVALON | AVALON | AVALON | ||||
AVALON, CA | AVALON, CA | AVALON | AVALON | AVALON | ||||
AVALON, SANTA CATALINA IS | AVALON, SANTA CATALINA IS | AVALON | AVALON | AVALON | ||||
AZUSA | AZUSA | AZUSA | AZUSA | AZUSA | ||||
BAKERSFEILD | BAKERSFEILD | BAKERSFEILD | #N/A | BAKERSFIELD | ||||
BAKERSFIELD | BAKERSFIELD | BAKERSFIELD | BAKERSFIELD | BAKERSFIELD | ||||
BAKERSFIELD, CA | BAKERSFIELD, CA | BAKERSFIELD | BAKERSFIELD | BAKERSFIELD | ||||
BAKERSFIELD, CA. | BAKERSFIELD, CA. | BAKERSFIELD | BAKERSFIELD | BAKERSFIELD | ||||
BAKERSFIELS, CA | BAKERSFIELS, CA | BAKERSFIELS | #N/A | BAKERSFIELD | ||||
BAKERSFILED | BAKERSFILED | BAKERSFILED | #N/A | BAKERSFIELD |
I think I can do steps 1-3, but then, I'm not clear on how to approach steps 3-4. Could someone please guide me on how to approach this problem? Appreciate it.
Thanks,
kwl
Solved! Go to Solution.
@binuacs What's the difference between \d{5,}+ and \d{5}+? Or \d{5},+? Fairly new to RegEx. Thank you!
you can play around with the regex on the site https://regex101.com/
\d{5} - In this case, the regex will look for exactly 5 digits, suppose the zip number is more than 5 in length it will not consider those zip
Trim(REGEX_Replace([ORIG DATASET], '\d{5,}', '')) - replaces the digits, one of your requirement was to remove the zip codes from the ORIG DATASET field. so I used the REGEX_Replace function , the above regex will replace any digits which are 5 or more in length
What's the purpose of using the Sample Tool above? Is it for testing? Also, how do I peel out the leftover that did not pass the 80% threshold in the Fuzzy Match Tool?
@knnwndlm If you take a look at the output of the sort tool you can find so many duplicate records. please see below
so I used the sample tool to pick the first record of each record id and excluded the duplicate records
The below join will give you the left over part
I have a zip code with numbers that are clearly not zip code, i.e., 4 digits, 8 digits. I only want to pick up those items with 5 digits and exclude all the others. How do I do this in RegEx? If I use \d{5}, it's gonna up the first five digits. I want everything that has more or less than 5 digits to be excluded.
Any suggestions?
Thanks,
K
@knnwndlm to pick the 5 char zip code you dont need to use the regex, you can simply filter the zip code based on the length. Attaching a sample