Grouping duplicate users over 30 attributes

Question

Hi Altreryx Hive Mind

I have a challenge that I have yet to find a robust solution

I have a ‘Person’ table some million-plus rows long and some 30 attributes/columns wide and many duplicated uses in it.

I am trying to group duplicate users together ready to be migrated to a new system, however, the data quality is not ideal (whenever is it!) and, for a given person, data attributes may or may not be populated or even different for different rows. (a classic data governance/quality issue!)

Take the (Dummy) sample data below, it is easy for me as a human to see that the first 4 rows are the same person despite the title & phone being different and the email missing on row one as there is enough information to extrapolate a relationship (e.g. the phone is the same for rows 1 &2 and the email is the same for rows 2,3 & 4)

I have tried several approaches but all without joy. The solutions tried half worked but none fully worked and none are scalable for the 30 some attributes that need checking.

# generating hash keys for column combinations and then aggregating (count group-by) on the hashkeys and comparing the counts

# Using the Fuzzy Match & Make Group tools (both listing columns individually and using long concatenated column) but couldn’t get it to work fully even on this small dummy dataset.

(1,2,3 & 4 Group) (5,6 & 7 Group) (8 Group) (9 Group)

OBJECT_IDFnameSnameTitleDOBEmailphone       1johnbrownMr02/02/1972 111112johnbrownMr02/02/1972ghjkghjk_AT_hotmail.com111113johnbrownCapt02/02/1972ghjkghjk_AT_hotmail.com999994johnbrown 02/02/1972ghjkghjk_AT_hotmail.com        5johnbrownMr20/01/2020qwertyui_AT_gmail.com555556johnbrownMr qwertyui_AT_gmail.com 7johnbrown 20/01/2020 55555       8johnbrownMr          9JuliebrownMrs25/09/1950abcdefghijk_AT_truemail.com88888

So, how would you tackle this? Any solution or approach you could offer which would scale for the number of rows and columns involved would be much appreciated.

Thanks

P

pdave87 · Answer

@PaulSmith

Just adding to my earlier response. Logic reference is shared below:

1. Master Key = Name + DOB (Validation for Master Key should flag - missing key elements - That should be scoped out of further grouping/investigation)
2. Master/Paired Key, Email, Phone, AA, BB, CC....other attributes = Subset pair of key > i.e. Master Key + Email + Phone + AA + BB + etc. (this validation should flag inconsistencies within the subset key paired/grouped with Master Key)

Hope this helps.

pdave87 · Answer

@PaulSmith I understand data sharing is based on PII compliant. Not an issue however, I would have really liked to deep-dive and validate my approach. For ease of understanding,

Grouping logic remains same across all steps i.e. creating pairs across the table (This effort is required to find unique > find most consistent and accurate row)

Steps will be;
1. Create pair for key elements i.e. most consistent data - Names (First, Middle & Second) & DOB - (Duplicate data elements can be removed here itself before moving to DQ checks as birth date is master identifier for any customer records)

2. Then grouping by inconsistent data with paired keys from step 1 - this will represent differences as DQ check

Logic of Key Health and DQ check is exactly same i.e. validating data between current vs previous row. Any inconsistency will flag - DQ check or Bad Keys in step 1 and 2

Validation of incremental data (i.e. columns) will always be linked to master key i.e. Name and birth date and DQ checks for incremental data will always require new subset of keys (assuming new column will further drill down on customer information that will expand the scope of uniqueness in available records)

Best regards,

Pratik

PaulSmith · Answer

Hi Pratik

An Intriguing start – I am interested to understand how you envisioned the DQ check to translate into a single user grouping function?

Sadly, I can't share any real data with you as it is personal information, however, the sample data captures the essence of the issue with the only difference being the scale. I have included a list of (slightly amended) columns being used to show the challenge.

‘Object_id’ is the Primary Key(PK) in the main source table and acting as the ‘key identifier’. There is a unique person identifier but, as the users are duplicated, a single person who is in the table 4 times they will also have 4 unique person identifiers.

## Source columns

1 A_OBJECT_ID (PK) 
2 A_ID_NUMBER
3 A_FIRST_NAME
4 B_FIRST_NAME
5 A_PREFERRED_NAME
6 B_KNOWNAS
7 A_MIDDLE_NAME
8 B_MIDDLE_NAME
9 A_SURNAME
10 B_SURNAME
11 B_KNOWNAS_SURNAME
12 A_BIRTH_CERTIFICATE_SURNAME
13 A_TITLE
14 B_TITLE
15 A_DATE_OF_BIRTH
16 B_DATE_OF_BIRTH
17 A_GENDER
18 B_GENDER
19 B_GENDER_O
20 B_UNIQUE_ID
21 B_USERNAME
22 A_EMAIL
23 B_EMAIL
24 A_PHONE
25 B_PHONE
26 A_PHONE_M
27 B_PHONE_M
28 B_START_DATE
29 PT_PERSON_TYPE
30 PT_DESCRIPTION
31 A_POSTCODE_CANDIDATE
32 CERT_COMPETENCY
33 A_IDENTIFYING_CODE_0
34 A_ID_CARD
35 B_ID_CARD
36 A_PERSONALID
37 A_IDENTIFYING_CODE_1
38 A_IDENTIFYING_CODE_2
39 A_USER_NAME
40 A_IDENTIFYING_CODE_3
41 A_IDENTIFYING_CODE_4

## Derived columns
42 PHONE_S_Coalesce 
43 PHONE_M_S_Coalesce
44 PHONE_Coalesce
45 FIRST_NAME_Coalesce
46 MIDDLE_NAME_Coalesce
47 SURNAME_Coalesce
48 TITLE_Coalesce
49 POSTCODE
50 POSTCODE_VALIDATED