Hey world!
I am working on a massive data de-duplication project, working with a lot of legacy data on multiple contact records. I've written this process, which pairs records by email address, marks the "master" record as the one more recently edited, and overwrites any last name shorter than two characters with the other last name in the pair. (IE Cheshire C. becomes Cheshire Cat).
I'd like to split this into two outputs. Output 1 should be any record where neither record nor its pair have had their name overwritten. Output 2 should be any record or its pair have had a name change.
How would I determine this, and how would I put both halves of a pair into the same output record?
Current:
Level | First | Last | |
alice@wonderland.com | Master | Alice | Carroll |
alice@wonderland.com | Victim | Alice | Carroll |
ccat@wonderland.com | Victim | Cheshire | Cat |
ccat@wonderland.com | Master | Cheshire | C |
Goal:
Level | First | Last | SaveTo | |
alice@wonderland.com | Master | Alice | Carroll | File1 |
alice@wonderland.com | Victim | Alice | Carroll | File1 |
ccat@wonderland.com | Victim | Cheshire | Cat | File2 |
ccat@wonderland.com | Master | Cheshire | C | File2 |
Solved! Go to Solution.
Hello @agriese
In the attached example, I used a Tile tool to determine email records that had more than one First/Last Name combo. Once I joined that information back to all records of each email, I filtered them into the two output files. Let me know if you have any questions.
Hey @CharlieS, thanks for that example! I believe it works, but I'm not understanding the thought process that led you here. Would you mind explaining how it works, so I can troubleshoot on my own data?
Sure thing.
- The most important tool here is the Tile tool. I've set it to find unique values of [Email], [First], and [Last] and set it to group by the [Email] field. For each email address, this will determine the number of unique combinations of first and last names for that email. In the version that I send you, selecting [Email] as a unique field is redundant since it is set as a group by field.
- The Summarize tool will group by [Email] and returns the number of unique First/Last name combos found for that email.
- The number of unique name records for each email is then rejoined and filtered so emails that all have the same name go to File1 and emails with different name combos go to File2.
Thanks! I'll have to mess around with this Tile tool more going forward.