Alteryx Designer Desktop Discussions

PaulSmith · ‎10-04-2021

Hi Altreryx Hive Mind

I have a challenge that I have yet to find a robust solution

I have a ‘Person’ table some million-plus rows long and some 30 attributes/columns wide and many duplicated uses in it.

I am trying to group duplicate users together ready to be migrated to a new system, however, the data quality is not ideal (whenever is it!) and, for a given person, data attributes may or may not be populated or even different for different rows. (a classic data governance/quality issue!)

Take the (Dummy) sample data below, it is easy for me as a human to see that the first 4 rows are the same person despite the title & phone being different and the email missing on row one as there is enough information to extrapolate a relationship (e.g. the phone is the same for rows 1 &2 and the email is the same for rows 2,3 & 4)

I have tried several approaches but all without joy. The solutions tried half worked but none fully worked and none are scalable for the 30 some attributes that need checking.

# generating hash keys for column combinations and then aggregating (count group-by) on the hashkeys and comparing the counts

# Using the Fuzzy Match & Make Group tools (both listing columns individually and using long concatenated column) but couldn’t get it to work fully even on this small dummy dataset.

(1,2,3 & 4 Group) (5,6 & 7 Group) (8 Group) (9 Group)

OBJECT_ID	Fname	Sname	Title	DOB	Email	phone

1	john	brown	Mr	02/02/1972		11111
2	john	brown	Mr	02/02/1972	ghjkghjk_AT_hotmail.com	11111
3	john	brown	Capt	02/02/1972	ghjkghjk_AT_hotmail.com	99999
4	john	brown		02/02/1972	ghjkghjk_AT_hotmail.com

5	john	brown	Mr	20/01/2020	qwertyui_AT_gmail.com	55555
6	john	brown	Mr		qwertyui_AT_gmail.com
7	john	brown		20/01/2020		55555

8	john	brown	Mr

9	Julie	brown	Mrs	25/09/1950	abcdefghijk_AT_truemail.com	88888

So, how would you tackle this? Any solution or approach you could offer which would scale for the number of rows and columns involved would be much appreciated.

Thanks

P

pdave87 · ‎10-04-2021

@PaulSmith thanks for sharing. 100% agree on data quality and governance comment, this is a one of the best use cases for learners like me. From my experience, I may think of few points below: (apologies if this already thought about)

As a starting point:

1. Can I divide the data? i.e. exclude / prioritize most relevant attributes/columns or create a - Super-set of this dataset?

2. Can I leverage the grouping logic with most consistent columns i.e. First and Second name. If that approach is acceptable (which means we will loose other inconsistent data within other attributes like different email and telephone information) then we can look into transforming the data with high priority + consistent data.

I really liked this question and I am totally open to the opinion of other experts from the community. It is a fabulous use case of data governance and quality as you mentioned earlier. We had a team of big data experts breaking-down the source data into several databases in oracle and setup dependencies that will lead to required information but above all it starts with a logic of building super-set and then identify/create links.

Looking forward to learn from everybody here.

Best regards,

Pratik

PaulSmith · ‎10-05-2021

Hi Pratik

Thanks for looking at the issue. As always with Alteryx, feel free to break it down, concatenate or manipulate the data in any way that works for you achieves the objective 😀

I found that developing a solution for the 6 attributes listed is relatively straightforward (you could almost hard code the logic in a few Multi-Row Formula tools) the real challenge here is finding a solution that will scale and work over some 30 attributes. If you start checking for specific column permutations it starts getting out-of-hand very quickly.

I too am looking forward to seeing the creativity and ability of this community!

Paul

pdave87 · ‎10-05-2021

@PaulSmith happy to give it a try based on the approach best to my knowledge.

I would like to see slice of your data and understand if you have something in mind to create 'master key' based on which we can built super-set and subset logic.

Below is just a sample of starting point 1) Key identification and 2) DQ checks. I am applying simple logic to segregate consistent vs inconsistent data and then take next steps as suitable in the process.

So here's my suggestion; (workflow attached for your test run)

Step 1 - Key Identification and Health

Step 2 - DQ Check

PaulSmith · ‎10-05-2021

Hi Pratik

An Intriguing start – I am interested to understand how you envisioned the DQ check to translate into a single user grouping function?

Sadly, I can't share any real data with you as it is personal information, however, the sample data captures the essence of the issue with the only difference being the scale. I have included a list of (slightly amended) columns being used to show the challenge.

‘Object_id’ is the Primary Key(PK) in the main source table and acting as the ‘key identifier’. There is a unique person identifier but, as the users are duplicated, a single person who is in the table 4 times they will also have 4 unique person identifiers.

## Source columns

1 A_OBJECT_ID (PK)
2 A_ID_NUMBER
3 A_FIRST_NAME
4 B_FIRST_NAME
5 A_PREFERRED_NAME
6 B_KNOWNAS
7 A_MIDDLE_NAME
8 B_MIDDLE_NAME
9 A_SURNAME
10 B_SURNAME
11 B_KNOWNAS_SURNAME
12 A_BIRTH_CERTIFICATE_SURNAME
13 A_TITLE
14 B_TITLE
15 A_DATE_OF_BIRTH
16 B_DATE_OF_BIRTH
17 A_GENDER
18 B_GENDER
19 B_GENDER_O
20 B_UNIQUE_ID
21 B_USERNAME
22 A_EMAIL
23 B_EMAIL
24 A_PHONE
25 B_PHONE
26 A_PHONE_M
27 B_PHONE_M
28 B_START_DATE
29 PT_PERSON_TYPE
30 PT_DESCRIPTION
31 A_POSTCODE_CANDIDATE
32 CERT_COMPETENCY
33 A_IDENTIFYING_CODE_0
34 A_ID_CARD
35 B_ID_CARD
36 A_PERSONALID
37 A_IDENTIFYING_CODE_1
38 A_IDENTIFYING_CODE_2
39 A_USER_NAME
40 A_IDENTIFYING_CODE_3
41 A_IDENTIFYING_CODE_4

## Derived columns
42 PHONE_S_Coalesce
43 PHONE_M_S_Coalesce
44 PHONE_Coalesce
45 FIRST_NAME_Coalesce
46 MIDDLE_NAME_Coalesce
47 SURNAME_Coalesce
48 TITLE_Coalesce
49 POSTCODE
50 POSTCODE_VALIDATED

pdave87 · ‎10-05-2021

@PaulSmith I understand data sharing is based on PII compliant. Not an issue however, I would have really liked to deep-dive and validate my approach. For ease of understanding,

Grouping logic remains same across all steps i.e. creating pairs across the table (This effort is required to find unique > find most consistent and accurate row)

Steps will be;
1. Create pair for key elements i.e. most consistent data - Names (First, Middle & Second) & DOB - (Duplicate data elements can be removed here itself before moving to DQ checks as birth date is master identifier for any customer records)

2. Then grouping by inconsistent data with paired keys from step 1 - this will represent differences as DQ check

Logic of Key Health and DQ check is exactly same i.e. validating data between current vs previous row. Any inconsistency will flag - DQ check or Bad Keys in step 1 and 2

Validation of incremental data (i.e. columns) will always be linked to master key i.e. Name and birth date and DQ checks for incremental data will always require new subset of keys (assuming new column will further drill down on customer information that will expand the scope of uniqueness in available records)

Best regards,

Pratik

pdave87 · ‎10-05-2021

@PaulSmith

Just adding to my earlier response. Logic reference is shared below:

1. Master Key = Name + DOB (Validation for Master Key should flag - missing key elements - That should be scoped out of further grouping/investigation)
2. Master/Paired Key, Email, Phone, AA, BB, CC....other attributes = Subset pair of key > i.e. Master Key + Email + Phone + AA + BB + etc. (this validation should flag inconsistencies within the subset key paired/grouped with Master Key)

Hope this helps.

Alteryx Designer Desktop Discussions

Grouping duplicate users over 30 attributes