Alteryx Designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.
ALTER.NEXT:

Join us on Dec 2 for a half-day virtual analytics + data science event!
US & CA customers only

SAVE YOUR SPOT
It's the most wonderful time of the year - Santalytics 2020 is here! This year, Santa's workshop needs the help of the Alteryx Community to help get back on track, so head over to the Group Hub for all the info to get started!

Active Learning Fuzzy Matching in Alteryx With Python's DEDUPE

Highlighted
Alteryx
Alteryx

Hi everyone,

 

As the end of the year approaches fast I have finally been able to find time to do a bit of research on advanced techniques of utilizing active learning fuzzy matching in Alteryx. 

 

After having some challenges with applying Fuzzy Matching tools to non-English languages through this year (like Czech, German, or Danish special characters) I was in a search for a technique that would complement the existing toolset in Alteryx and overcome this shortcoming. 

 

Just recently posted article here at AYX COMMUNITY confirmed that there is a hunger for this from the wider audience. Completely agree with @fmvizcaino on the approach listed in that post. Replacing special chars or handling them in preprocessing techniques may work nicely. I was trying to research what other options are there though.

 

I thought that alternative approach may be worth sharing. Not only does the use of DEDUPE seem to work A OK with non-english characters, as it incorporates active learning (asking the user for help during training stage) it should be able to handle these challenges just fine because of this approach.

 

I have originally found out about this package from a post on Medium but then found out that Dr. Nick (depicted below) has beaten me to it. So shout out where shout out is due... massive thanks to @NickJ for actually testing the dedupe package originally and laying groundwork to this as an interesting extension of fuzzy matching available in Alteryx.

 

DavidM_0-1575383038435.png

 

What is dedupe package?

Python's dedupe is a library that uses machine learning to perform de-duplication and entity resolution quickly on structured data.

 

dedupe will help you:

  • remove duplicate entries from a spreadsheet of names and addresses

  • link a list with customer information to another with order history, even without unique customer IDs

  • take a database of campaign contributions and figure out which ones were made by the same person, even if the names were entered slightly differently for each record

dedupe takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.

 

You can find more about dedupe package including examples at https://github.com/dedupeio/dedupe

 

How Dedupe Works?

Effective deduplication relies largely on domain expertise.

 

This is for two main reasons: first, because domain experts develop a set of heuristics that enable them to conceptualize what a canonical version of a record should look like, even if they’ve never seen it in practice.

Second, domain experts instinctively recognize which record subfields are most likely to uniquely identify a record; they just know where to look.

 

As such, Dedupe works by engaging the user in labeling the data via a command line interface, and using machine learning on the resulting training data to predict similar or matching records within unseen data.

 

and this great article at medium.

 

Example

 

Lets take the example from dedupe.io. 

 

This demonstrates how to use dedupe with a comma separated values (CSV) file.

All operations are performed in memory, so will run very quickly on datasets up to ~10,000 rows.

 

We start with a CSV file containing our messy data. In this example, it is listings of early childhood education centers in Chicago compiled from several different sources.

 

The output will be a CSV with our clustered results.

 

 

Workflow

First though, you can find a workflow with full annotated code attached to this post.

 

Follow the comments through the code - they should make it clear what to do if you want to retrain, allow active learning or simply reuse the existing model you have already created.

 

Note that model retrain + generating outputs and clustering the data on this relatively small dataset takes about ~10+ minutes.

 

DavidM_1-1575383650029.png

 

This is how the input tool like - we have quite a bit of duplicates in SITE NAMES and ADDRESSES.

 

DavidM_2-1575384166615.png

 

And as the output we have the same list with indication of clusters and confidence score of potential duplicate match.

 

DavidM_3-1575384421536.png

 

Active Learning

 

dedupe has the ability to prompt the user to engage in active learning by showing pairs of entities and asking if they are the same or different.

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

Active learning is the so-called special sauce behind Dedupe. As in most supervised machine learning tasks, the challenge is to get labeled data that the model can learn from.

 

The active learning phase in Dedupe is essentially an extended user-labeling session, which can be short if you have a small dataset and can take longer if your dataset is large.

 

It’s important to note that the blocking, active learning and supervised learning portions of the deduplication process are very dependent on the dataset attributes that the user nominates for selection.

User experience and domain knowledge factor in heavily at multiple phases of the deduplication process.

 

So how does it work in practice?

 

In the attached code, if you simply can't locate the file path to the settings_file (ie this is missing), this will cause model retrain + also start the active learning bit.

 

Make sure you run Python code tool in interactive mode for this as otherwise you will be getting tons of exceptions saying that Alteryx can't run this in interactive mode straight from workflow itself.

 

Obviously the model that you build and also all your active learning can be save during this process - check out the part of code on setting file and train file respectively.

 

Below is a sample of how the active learning interface looks like.

 

DavidM_4-1575385544794.png

 

DavidM_5-1575387390395.png

 

Summary

Using DEDUPE could be an interesting alternative over the out of the box Fuzzy Matching tools.

Primarily for two reasons:

 

One being that your SMEs/ experts can actually be part of labelling process of machine learning (i.e. the active learning bit).

 

Second, I have tested this on my several rows of data which include special characters (for instance my name which is Czech as has these weird special signs above the letters) and it worked nicely.

 

All the workflows, sample data, and Python code can be found in the attached workflows.

 

Last thing just for reference, the code

 

from ayx import Alteryx
Alteryx.installPackages("dedupe")
Alteryx.installPackages("unidecode")

import dedupe
import os
import csv
import re
import logging
import optparse
from unidecode import unidecode

# Point to the file you want to dedupe
input_file = '//Mac/Home/Downloads/Active Learning for Data Deduplication/csv_example_messy_input.csv'

# Results of deduping
output_file = '//Mac/Home/Downloads/Active Learning for Data Deduplication/csv_example_output.csv'

# Setting files as result of training your dedupe model
settings_file = '//Mac/Home/Downloads/Active Learning for Data Deduplication/csv_example_learned_settings'

# Training file that will be used to save your model
training_file = '//Mac/Home/Downloads/Active Learning for Data Deduplication/csv_example_training.json'
def preProcess(column):
    """
    Do a little bit of data cleaning with the help of Unidecode and Regex.
    Things like casing, extra spaces, quotes and new lines can be ignored.
    This could also be done using parsing tools in AYX.
    """
    try : # python 2/3 string differences
        column = column.decode('utf8')
    except AttributeError:
        pass
    column = unidecode(column)
    column = re.sub('  +', ' ', column)
    column = re.sub('\n', ' ', column)
    column = column.strip().strip('"').strip("'").lower().strip()
    # If data is missing, indicate that by setting the value to `None`
    if not column:
        column = None
    return column

 

def readData(filename):
    """
    Read in our data from a CSV file and create a dictionary of records, 
    where the key is a unique record ID and each value is dict
    """

    data_d = {}
    with open(filename) as f:
        reader = csv.DictReader(f)
        for row in reader:
            clean_row = [(k, preProcess(v)) for (k, v) in row.items()]
            row_id = int(row['Id'])
            data_d[row_id] = dict(clean_row)

    return data_d

print('importing data ...')
data_d = readData(input_file)

 

# If a settings file already exists, we'll just load that and skip training
# Note you need a different settings file for every dataset you are deduping
if os.path.exists(settings_file):
    print('reading from', settings_file)
    with open(settings_file, 'rb') as f:
        deduper = dedupe.StaticDedupe(f)
else:
    # ## Training

    # Define the fields dedupe will pay attention to
    # Will differ for each dataset you are processing
    fields = [
        {'field' : 'Site name', 'type': 'String'},
        {'field' : 'Address', 'type': 'String'},
        {'field' : 'Zip', 'type': 'Exact', 'has missing' : True},
        {'field' : 'Phone', 'type': 'String', 'has missing' : True},
        ]

    # Create a new deduper object and pass our data model to it.
    deduper = dedupe.Dedupe(fields)

    # To train dedupe, we feed it a sample of records.
    deduper.sample(data_d, 15000)

    # If we have training data saved from a previous run of dedupe,
    # look for it and load it in.
    # __Note:__ if you want to train from scratch, delete the training_file
    if os.path.exists(training_file):
        print('reading labeled examples from ', training_file)
        with open(training_file, 'rb') as f:
            deduper.readTraining(f)
            
    # ## Active learning
    # Dedupe will find the next pair of records
    # it is least certain about and ask you to label them as duplicates
    # or not.
    # use 'y', 'n' and 'u' keys to flag duplicates
    # press 'f' when you are finished
    print('starting active labeling...')

    # Dedupe will find the next pair of records it is least certain about and ask you to label them as duplicates or not. use 'y', 'n' and 'u' keys to flag duplicates press 'f' when you are finished
    dedupe.consoleLabel(deduper)

    # Using the examples we just labeled, train the deduper and learn
    # blocking predicates
    deduper.train()

    # When finished, save our training to disk
    with open(training_file, 'w') as tf:
        deduper.writeTraining(tf)

    # Save our weights and predicates to disk.  If the settings file
    # exists, we will skip all the training and learning next time we run
    # this file.
    with open(settings_file, 'wb') as sf:
        deduper.writeSettings(sf)
        
# Find the threshold that will maximize a weighted average of our
# precision and recall.  When we set the recall weight to 2, we are
# saying we care twice as much about recall as we do precision.
#
# If we had more data, we would not pass in all the blocked data into
# this function but a representative sample.

threshold = deduper.threshold(data_d, recall_weight=1)

# ## Clustering

# `match` will return sets of record IDs that dedupe
# believes are all referring to the same entity.

print('clustering...')
clustered_dupes = deduper.match(data_d, threshold)

print('# duplicate sets', len(clustered_dupes))

# ## Writing Results

# Write our original data back out to a CSV with a new column called 
# 'Cluster ID' which indicates which records refer to each other.

cluster_membership = {}
cluster_id = 0
for (cluster_id, cluster) in enumerate(clustered_dupes):
    id_set, scores = cluster
    cluster_d = [data_d[c] for c in id_set]
    canonical_rep = dedupe.canonicalize(cluster_d)
    for record_id, score in zip(id_set, scores):
        cluster_membership[record_id] = {
            "cluster id" : cluster_id,
            "canonical representation" : canonical_rep,
            "confidence": score
        }

singleton_id = cluster_id + 1

with open(output_file, 'w') as f_output, open(input_file) as f_input:
    writer = csv.writer(f_output)
    reader = csv.reader(f_input)

    heading_row = next(reader)
    heading_row.insert(0, 'confidence_score')
    heading_row.insert(0, 'Cluster ID')
    canonical_keys = canonical_rep.keys()
    for key in canonical_keys:
        heading_row.append('canonical_' + key)

    writer.writerow(heading_row)

    for row in reader:
        row_id = int(row[0])
        if row_id in cluster_membership:
            cluster_id = cluster_membership[row_id]["cluster id"]
            canonical_rep = cluster_membership[row_id]["canonical representation"]
            row.insert(0, cluster_membership[row_id]['confidence'])
            row.insert(0, cluster_id)
            for key in canonical_keys:
                row.append(canonical_rep[key].encode('utf8'))
        else:
            row.insert(0, None)
            row.insert(0, singleton_id)
            singleton_id += 1
            for key in canonical_keys:
                row.append(None)
        writer.writerow(row)

 

David Matyas
Sales Engineer
Alteryx
Highlighted
Alteryx
Alteryx

@DavidM just wanted to leave a comment and let you know that these write-ups are awesome! Keep up the great work. 

Highlighted
Alteryx
Alteryx

Thanks @BrandonB appreciate your kind words 😉

David Matyas
Sales Engineer
Alteryx
Labels