Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Parse/Fuzzy Match

knnwndlm
8 - Asteroid

Hi Pro User,

 

I have a list of cities with names that need to be corrected - extraneous characters after names, wrong spelling, etc. - based on a list the Lookup_Table.  Below is a file with two tabs - the Data tab that shows the different stages of cleansing and the Lookup_Table tab that provides the correct info.

 

In the Data tab, Step 0 is a list of data (a-c city names). Step 1 peels out those items that have the 5 digit zip code.  Step 2 removes everything starting with a character after a name.  Step 3 looks up the Lookup_Table to find the correct spelling of the city names.  Step 4 makes an additional correction for those with incorrect city names.

 

Here are the steps:

 

STEP 0STEP 1 STEP 2 STEP 3 STEP 4 
ORIG DATASETREMOVE ZIPCODEPOSTALREMOVE CHARSPOSTALTABLE LOOKUP POSTALCORRECTIONPOSTAL
***WILL CALL FOR JOBSITE***WILL CALL FOR JOBSITE     TBD 
00000000 0000 #N/A TBD 
29 PALMS, CA29 PALMS, CA 29 PALMS #N/A TBD 
9366493664936649366493664#N/A93664TBD93664
ADELANTOADELANTO ADELANTO ADELANTO ADELANTO 
ADELANTO, CAADELANTO, CA ADELANTO ADELANTO ADELANTO 
AGOURAAGOURA AGOURA #N/A AGOURA HILLS 
AGOURA HILLSAGOURA HILLS AGOURA HILLS AGOURA HILLS AGOURA HILLS 
AGOURA HILLS, CALIFORNIAAGOURA HILLS, CALIFORNIA AGOURA HILLS AGOURA HILLS AGOURA HILLS 
AGUANAAGUANA AGUANA #N/A TBD 
AHAMBRAAHAMBRA AHAMBRA #N/A ALHAMBRA 
ALHAMBRAALHAMBRA ALHAMBRA ALHAMBRA ALHAMBRA 
ALHAMBRA CAALHAMBRA CA ALHAMBRA ALHAMBRA ALHAMBRA 
ALHAMBRA CA.ALHAMBRA CA. ALHAMBRA ALHAMBRA ALHAMBRA 
ALHAMBRA, CAALHAMBRA, CA ALHAMBRA ALHAMBRA ALHAMBRA 
ALHAMBRA, CA 91803ALHAMBRA, CA 9180391803ALHAMBRA91803ALHAMBRA91803ALHAMBRA91803
ALHAMBRA, CA 91806ALHAMBRA, CA 9180691806ALHAMBRA91806ALHAMBRA91806ALHAMBRA91806
ALHAMBRA, CA.ALHAMBRA, CA. ALHAMBRA ALHAMBRA ALHAMBRA 
ALHAMBRA, CALIFORNIAALHAMBRA, CALIFORNIA ALHAMBRA ALHAMBRA ALHAMBRA 
ALHAMBRA,CAALHAMBRA,CA ALHAMBRA ALHAMBRA ALHAMBRA 
ALISO VIEJO, CAALISO VIEJO, CA ALISO VIEJO ALISO VIEJO ALHAMBRA 
ALTADENAALTADENA ALTADENA ALTADENA ALTADENA 
ANAHEIMANAHEIM ANAHEIM ANAHEIM ANAHEIM 
ANAHEIM, CAANAHEIM, CA ANAHEIM ANAHEIM ANAHEIM 
ANYWHERE, CAANYWHERE, CA ANYWHERE #N/A TBD 
ANZAANZA ANZA ANZA ANZA 
APPLE VALLEYAPPLE VALLEY APPLE VALLEY APPLE VALLEY APPLE VALLEY 
APPLE VALLEY, CAAPPLE VALLEY, CA APPLE VALLEY APPLE VALLEY APPLE VALLEY 
ARCADIAARCADIA ARCADIA ARCADIA ARCADIA 
AS REQUESTEDAS REQUESTED AS REQUESTED #N/A TBD 
AUBERRYAUBERRY AUBERRY AUBERRY AUBERRY 
AVALONAVALON AVALON AVALON AVALON 
AVALON, CAAVALON, CA AVALON AVALON AVALON 
AVALON, SANTA CATALINA ISAVALON, SANTA CATALINA IS AVALON AVALON AVALON 
AZUSAAZUSA AZUSA AZUSA AZUSA 
BAKERSFEILDBAKERSFEILD BAKERSFEILD #N/A BAKERSFIELD 
BAKERSFIELDBAKERSFIELD BAKERSFIELD BAKERSFIELD BAKERSFIELD 
BAKERSFIELD, CABAKERSFIELD, CA BAKERSFIELD BAKERSFIELD BAKERSFIELD 
BAKERSFIELD, CA.BAKERSFIELD, CA. BAKERSFIELD BAKERSFIELD BAKERSFIELD 
BAKERSFIELS, CABAKERSFIELS, CA BAKERSFIELS #N/A BAKERSFIELD 
BAKERSFILEDBAKERSFILED BAKERSFILED #N/A BAKERSFIELD 

 

I think I can do steps 1-3, but then, I'm not clear on how to approach steps 3-4.  Could someone please guide me on how to approach this problem?  Appreciate it.

 

Thanks,

kwl

10 REPLIES 10
binuacs
20 - Arcturus

@knnwndlm There might be a better way of doing this

 

binuacs_0-1660603424492.png

 

knnwndlm
8 - Asteroid

@binuacs What's the difference between \d{5,}+ and \d{5}+?  Or \d{5},+?  Fairly new to RegEx.  Thank you!

binuacs
20 - Arcturus

@knnwndlm 

 

\d{5,}+
-----------------
 
\d-matches a digit (equivalent to [0-9])
{5,}+ matches the previous token between 5 and unlimited times, as many times as possible, without giving back (possessive)
 
in the below scenario I need to extract the 5-digit zip code, so I used this pattern \d{5,}

 

binuacs_0-1660729103024.png

 

you can play around with the regex on the site https://regex101.com/ 

 

binuacs_1-1660729464601.png

 

 

 \d{5} - In this case, the regex will look for exactly 5 digits, suppose the zip number is more than 5 in length it will not consider those zip

 

 

Trim(REGEX_Replace([ORIG DATASET], '\d{5,}', '')) - replaces the digits, one of your requirement was to remove the zip codes from the ORIG DATASET field. so I used the REGEX_Replace function , the above regex will replace any digits which are 5 or more in length

binuacs_2-1660729677545.png

 

 

knnwndlm
8 - Asteroid

@binuacs 

 

Thank you!

knnwndlm
8 - Asteroid

@binuacs 

 

What's the purpose of using the Sample Tool above?  Is it for testing?  Also, how do I peel out the leftover that did not pass the 80% threshold in the Fuzzy Match Tool?

binuacs
20 - Arcturus

@knnwndlm If you take a look at the output of the sort tool you can find so many duplicate records. please see below

 

binuacs_1-1660862702815.png

so I used the sample tool to pick the first record of each record id and excluded the duplicate records

 

binuacs_2-1660862844381.png

 

The below join will give you the left over part

binuacs_3-1660863210842.png

 

knnwndlm
8 - Asteroid

@binuacs 

 

Thank you much for clarifying the Sample Tool usage and the model update.  

knnwndlm
8 - Asteroid

@binuacs 

 

I have a zip code with numbers that are clearly not zip code, i.e., 4 digits, 8 digits.  I only want to pick up those items with 5 digits and exclude all the others.  How do I do this in RegEx?  If I use \d{5}, it's gonna up the first five digits.  I want everything that has more or less than 5 digits to be excluded.  

 

Any suggestions?

 

Thanks,

K

binuacs
20 - Arcturus

@knnwndlm to pick the 5 char zip code you dont need to use the regex,  you can simply filter the zip code based on the length. Attaching a sample

 

binuacs_0-1665701115432.png

 

Labels