Here's an example of what my data looks like (there are more fields but I only included the relevant ones):
First Name | Last Name | Primary Advisor | Gender | Home Address 1 | Home City | Home Phone | Home State | Home Zip | Mobile Phone |
Joan | Smith | Jones, Jimmie | Po Box 01332 | Fort Myers | +1 (239) 562-2834 | FL | 33919 | ||
Joan | Smith | Female | PO Box 01332 | Ft. Myers | +1 (239) 184-9749 | FL | 33919 | +1 (239) 489-4700 |
Ideally, I'd want the output to be something like the following:
First Name | Last Name | Primary Advisor | Gender | Home Address 1 | Home City | Home Phone | Home State | Home Zip | Mobile Phone |
Joan | Smith | Jones, Jimmie | Female | PO Box 01332 | Fort Myers | +1 (239) 562-2834 | FL | 33919 | +1 (239) 489-4700 |
Basically, fill in the blanks/missing fields. If there's a conflict (both rows have data for the same field), I'm okay with just keeping the data from the first row, at least for that field.
Another issue is there is no guarantee of first + last name being unique. For example, these are the other records I have for "Joan Smith", which appears to be at least 7 unique people:
First Name | Last Name | Primary Advisor | Gender | Home Address 1 | Home City | Home Phone | Home State | Home Zip | Mobile Phone |
Joan | Smith | Bob Studer | 1856 Flamingo Place. Apt. #3 | Miami Beach | +1 305-698-0354 | FL | 33140 | ||
Joan | Smith | Steve Gilbert | |||||||
Joan | Smith | Jones, Jimmie | Po Box 01332 | Fort Myers | +1 (239) 562-2834 | FL | 33919 | ||
Joan | Smith | Female | PO Box 01332 | Ft. Myers | +1 (239) 184-9749 | FL | 33919 | +1 (239) 489-4700 | |
Joan | Smith | 1315 S Clewis Ct Apt 101 | Tampa | +1 8132365385 | FL | 33629 | |||
Joan | SMITH | Jack Henry | 230 ROUTE 579 | HAMPTON | +1 906-854-7400 | NJ | 08827-4345 | ||
Joan | Smith | Porpora Paul A. | 12 Lakeside Road | Mahopac | +1 (914) 369-4336 | NY | 10541 | ||
Joan A | Smith | One Metrotech Center | Brooklyn | NY | 11201 |
So my question is, how would I go about doing this? I've been looking into fuzzy matching but I'm not sure it'll solve my problem completely. I'm open to any suggestions, thank you.
Solved! Go to Solution.
Hi @dcoelho,
I think you can achieve this with a combination of a Fuzzy Match tool and a Formula tool. I have copied your table into the input data tool in the workflow attached.
The Fuzzy Match tool is setup in this example to look at Home Address (this can be amended to what you would prefer the fields to match on). It is also configured to 'Output unmatched records'. This means in the output of the tool you can see 6 records in two columns;
You then can join the data from the record IDs back in, so that you have your original data set, next to a copy of the data set from anything that has matched.
Using the Formula tool you can then add in IF statements to replace null values with anything that is populated from the matched Record ID (I have only added in Gender and Mobile Phone for this example)
I hope this helps. Let me know if you have any questions.
Regards
Will
@wdavis, thank you for your reply. It shows me how little I know about Alteryx because that workflow is more complex than anything I've tried for this solution (and it's not even that complicated of a workflow).
Unfortunately I'm not sure it's working for my full data set. Running my data through this workflow brought it from 153,000 records to a little under 12,000. At first I thought maybe I do have that many duplicates. However, running the original data through the Unique tool and selecting First Name, Last Name, and Address as the unique fields I got 137,000 unique records matching that criteria.
Looking through the output, it appears that it actually deleted all 8 Joan Smith records.
Hi @dcoelho,
It might be worth having a look at some of our 'Tool Mastery' guides available on community, which are really useful when starting to use new tools, especially those outside of the Favourites toolbar. This guide on Fuzzy Matching should hopefully help with your understanding - https://community.alteryx.com/t5/Alteryx-Knowledge-Base/Tool-Mastery-Fuzzy-Match/ta-p/45485
The Unique tool is likely to be pulling through more unique rows that are in fact not unique and could have differences due to slight spelling changes, e.g Joan Smith and joan smith would both be considered Unique rows within the tool due to the case differences. Likewise, if any of the rows are missing data for one of the 3 fields, then it will identify these as unique values.
Which fields are you currently using in your fuzzy match tool?
Regards
Will
Will,
I ended up going a different route to solve this issue, but I appreciate your assistance regardless. Thank you for your help!
Hi dcoelho,
Do you mind sharing what route you ended up taking to solve this issue? I have the exact same issue you have illustrated above and would appreciate any insight! Thank you.