Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Ideas

Share your Designer Desktop product ideas - we're listening!
Submitting an Idea?

Be sure to review our Idea Submission Guidelines for more information!

Submission Guidelines

Unique key detector tool

Hello,

This is a feature I haven't seen in any data prepation/etl. The core feature is to detect the unique key in a dataframe. More than often, you have to deal with a dataset without knowing what's make a row unique. This can lead to misinterpret the data, cartesian product at join and other funny stuff.

How do I imagine that ?

a specific tool in the Data Investigation category

Entry; one dataframe, ability to select fields or check all, ability to specify a max number of field for combination (empty or 0=no max).
Algo : it tests the count distinct every combination of field versus the count of rows

Result : one row by field combination that works. If no result : "no field combination is unique. check for duplicate or need for aggregation upstream".

ex :

 

order_id line_id amount customer site

11100AU_250
1212AU_250
1345AU_250
2175AU_250
2212AU_250
3115BU_250
4145BU_251

The user will select every field but excluding Amount (he knows that Amount would have no sense in key)

The algo will test the following key
-each separate field
-each combination of two fields
-each combination of three fields
-each combination of four fields

to match the number of row (7)
And gives something like that

 

choice number of fields field combination

very good2order_id,line_id
average3order_id,line_id, customer
average3order_id,line_id, site
bad4order_id,line_id, site, customer
….

Best regards,

Simon

1 Comment
simonaubert_bd
13 - Pulsar

The python code for such a function

import pandas as pd
from itertools import combinations

def detect_unique_key(df, fields, max_combination=0):
    
    #Detect unique keys in a DataFrame based on field combinations.

    #Parameters:
        #df (pd.DataFrame): The DataFrame to analyze.
        #fields (list): List of fields to test for uniqueness.
        #max_combination (int): Maximum number of fields to combine (0 = no limit).

    #Returns:
        #pd.DataFrame: A DataFrame with columns ['choice', 'number_of_fields', 'field_combination'].
    
    total_rows = len(df)
    max_combination = max_combination if max_combination > 0 else len(fields)

    results = []
    found_minimum = None

    for r in range(1, max_combination + 1):
        for combo in combinations(fields, r):
            combo = list(combo)
            unique_count = len(df.drop_duplicates(subset=combo))

            if unique_count == total_rows:
                if found_minimum is None:
                    found_minimum = r

                if r == found_minimum:
                    choice = "very good"
                elif r == found_minimum + 1:
                    choice = "average"
                else:
                    choice = "bad"

                results.append({
                    "choice": choice,
                    "number_of_fields": r,
                    "field_combination": combo
                })

    if not results:
        print("No field combination is unique. Check for duplicates or aggregation upstream.")
        return pd.DataFrame(columns=["choice", "number_of_fields", "field_combination"])

    return pd.DataFrame(results)

# Example usage
data = {
    'A': [1, 2, 3, 4],
    'B': [1, 2, 2, 4],
    'C': [5, 6, 7, 8]
}

df = pd.DataFrame(data)
fields = ['A', 'B', 'C']
result = detect_unique_key(df, fields)
print(result)