Alteryx Designer Desktop Ideas

simonaubert_bd · ‎09-11-2024

Hello,

This is a feature I haven't seen in any data prepation/etl. The core feature is to detect the unique key in a dataframe. More than often, you have to deal with a dataset without knowing what's make a row unique. This can lead to misinterpret the data, cartesian product at join and other funny stuff.

How do I imagine that ?

a specific tool in the Data Investigation category

Entry; one dataframe, ability to select fields or check all, ability to specify a max number of field for combination (empty or 0=no max).
Algo : it tests the count distinct every combination of field versus the count of rows

Result : one row by field combination that works. If no result : "no field combination is unique. check for duplicate or need for aggregation upstream".

ex :

order_id line_id amount customer site

1	1	100	A	U_250
1	2	12	A	U_250
1	3	45	A	U_250
2	1	75	A	U_250
2	2	12	A	U_250
3	1	15	B	U_250
4	1	45	B	U_251

The user will select every field but excluding Amount (he knows that Amount would have no sense in key)

The algo will test the following key
-each separate field
-each combination of two fields
-each combination of three fields
-each combination of four fields

to match the number of row (7)
And gives something like that

choice number of fields field combination

very good	2	order_id,line_id
average	3	order_id,line_id, customer
average	3	order_id,line_id, site
bad	4	order_id,line_id, site, customer
…	…	….

Best regards,

Simon

simonaubert_bd · ‎11-22-2024

The python code for such a function

import pandas as pd
from itertools import combinations

def detect_unique_key(df, fields, max_combination=0):
    
    #Detect unique keys in a DataFrame based on field combinations.

    #Parameters:
        #df (pd.DataFrame): The DataFrame to analyze.
        #fields (list): List of fields to test for uniqueness.
        #max_combination (int): Maximum number of fields to combine (0 = no limit).

    #Returns:
        #pd.DataFrame: A DataFrame with columns ['choice', 'number_of_fields', 'field_combination'].
    
    total_rows = len(df)
    max_combination = max_combination if max_combination > 0 else len(fields)

    results = []
    found_minimum = None

    for r in range(1, max_combination + 1):
        for combo in combinations(fields, r):
            combo = list(combo)
            unique_count = len(df.drop_duplicates(subset=combo))

            if unique_count == total_rows:
                if found_minimum is None:
                    found_minimum = r

                if r == found_minimum:
                    choice = "very good"
                elif r == found_minimum + 1:
                    choice = "average"
                else:
                    choice = "bad"

                results.append({
                    "choice": choice,
                    "number_of_fields": r,
                    "field_combination": combo
                })

    if not results:
        print("No field combination is unique. Check for duplicates or aggregation upstream.")
        return pd.DataFrame(columns=["choice", "number_of_fields", "field_combination"])

    return pd.DataFrame(results)

# Example usage
data = {
    'A': [1, 2, 3, 4],
    'B': [1, 2, 2, 4],
    'C': [5, 6, 7, 8]
}

df = pd.DataFrame(data)
fields = ['A', 'B', 'C']
result = detect_unique_key(df, fields)
print(result)

Alteryx Designer Desktop Ideas

Submitting an Idea?

Unique key detector tool