Hi Altreryx Hive Mind
I have a challenge that I have yet to find a robust solution
I have a ‘Person’ table some million-plus rows long and some 30 attributes/columns wide and many duplicated uses in it.
I am trying to group duplicate users together ready to be migrated to a new system, however, the data quality is not ideal (whenever is it!) and, for a given person, data attributes may or may not be populated or even different for different rows. (a classic data governance/quality issue!)
Take the (Dummy) sample data below, it is easy for me as a human to see that the first 4 rows are the same person despite the title & phone being different and the email missing on row one as there is enough information to extrapolate a relationship (e.g. the phone is the same for rows 1 &2 and the email is the same for rows 2,3 & 4)
I have tried several approaches but all without joy. The solutions tried half worked but none fully worked and none are scalable for the 30 some attributes that need checking.
# generating hash keys for column combinations and then aggregating (count group-by) on the hashkeys and comparing the counts
# Using the Fuzzy Match & Make Group tools (both listing columns individually and using long concatenated column) but couldn’t get it to work fully even on this small dummy dataset.
(1,2,3 & 4 Group) (5,6 & 7 Group) (8 Group) (9 Group)
| OBJECT_ID | Fname | Sname | Title | DOB | Email | phone |
| | | | | | | |
| 1 | john | brown | Mr | 02/02/1972 | | 11111 |
| 2 | john | brown | Mr | 02/02/1972 | ghjkghjk_AT_hotmail.com | 11111 |
| 3 | john | brown | Capt | 02/02/1972 | ghjkghjk_AT_hotmail.com | 99999 |
| 4 | john | brown | | 02/02/1972 | ghjkghjk_AT_hotmail.com | |
| | | | | | | |
| 5 | john | brown | Mr | 20/01/2020 | qwertyui_AT_gmail.com | 55555 |
| 6 | john | brown | Mr | | qwertyui_AT_gmail.com | |
| 7 | john | brown | | 20/01/2020 | | 55555 |
| | | | | | | |
| 8 | john | brown | Mr | | | |
| | | | | | | |
| 9 | Julie | brown | Mrs | 25/09/1950 | abcdefghijk_AT_truemail.com | 88888 |
So, how would you tackle this? Any solution or approach you could offer which would scale for the number of rows and columns involved would be much appreciated.
Thanks
P