Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Fuzzy Matching Subsets, Common word isolation and Macro's for rerunning fuzzy data mergers

ibesmond
8 - Asteroid

I had a couple of questions that I thought might result in simple but helpful answers. Well... maybe the first two.

 

I'm spent close to a month learning, researching, reading the forums/knowledge base articles, watching videos and getting one-on-ones with Alteryx designers through the my company's account manager and another program they had going in the past.

 

In the end, the results had multiple duplicates, and I didn't feel very successful.  I ran out of time to complete the project and had to go with the results that came out of it workflow.  I wanted to ask these questions to help me think about how I should approach the task, how I should model it, and how should I automate it a little.

 

1.  I have data from close to 70 sources, and I am trying to merge them with a master dataset.  I have the company name and the state.  My question is, do you think my results would be cleaner if I filter the data down by state before I do the fuzzy matching?  I feel that with a smaller subset of data the chance of false positives is reduced because It doesn't have to consider similar names to companies in different states, even though I have my fuzzy match set to make an exact match on state.

 

For instance should I filter my master data to one state, and then fuzzy match it to a sub-source dataset for that state. I have two sub-sources with companies from all states and I may have up to 3 sub-sources per state with company names that I am trying to identify customers versus non customers.

 

2.  Many of the company names in my dataset contain the same words, and adding them as a key helps eliminate the records from matching on these words, but often times these words are the only things that differentiates them.  Does it make sense to try and isolate words that appear in these records and then run separate fuzzy matches?

 

For example "Farm" and "Orchard" might appear in the data close to 80% of the records. As an example records like Hillside Farm or Tennessee Farm and Produce, or records like as Hillside Orchard, Little Hillside Farm, Orchard on the Hillside. I was thinking to isolate Farm and Orchard in both datasets, then fuzzy match them separately. Is this a good strategy?

 

 3.  With so many sources of data is there a way to create a fuzzy match macro (batch or iterative) to repeat the process with multiple files.  For instance right now I have 1 master dataset, which is the company roster.  Next I have two sources that have company names from all 50 states in the US.  Thirdly, I have close to 3 sources per state for most of the 50 states.

 

My original (model) or thoughts were to union all the sources into one huge dataset and then try to fuzzy match using the customer name and state.  Besides that, I don't have anything more granular.  I don't know how to create a macro where it can take the master dataset, and compare it against one source.  Grab the needed data from the second source and merge it with the master.  For example I want to know what they grow.  My master database has the company name, customer ID and the state.  My second source has the company name and what they grow, for example apples.  At the end I want to two lists.  The first a list  the being the company name, customer ID, State and what they grow, in this case apples, and the second list I want is the non-matches: basically the list of company name, state and what they grow.  On top of that, my third source might have the same company name and something else that they grow such as pears.

 

The result would be a list that has the customer name, customer ID, State, (source 1) what they grow and (source 2) what they grow.  Same for the second list, excluding the customer ID. 

 

I know this is a much bigger ask then the suggestions on the first two questions, but I'm having trouble modeling my data, and wanted to get some outside perspective.  I don't think the best approach is to match them one by one manually, or create huge waterflow using hundreds of tools and fuzzy matchings to complete this.  Its basically repeating the process with a new source and appending it to the original output.  

 

The main question is what tools would I need to use if I were to go with a batch/iterative macro approach to have the output from running the workflow the first time become the input the second time the macro run's and so forth? and same thing with the sub-sources.  The first time the macro runs, and kicks out the non-matches as an export, that list becomes the input the next time the macro runs. 

 

In the end I'm trying to result with a master customer dataset that merges all the sources to get all the products they grow, and I have a secondary dataset of all the non-customers that contain all the products they grow. 

 

Any articles, etc are appreciated.  Advise on tools and configuration also also very helpful.  I have seen many videos and articles out there on the internet, but most of the time these examples talk about purging. So I'm missing the second half of merging the data after the matching and grouping.  I know this is not an easy answer, but anything that will put me in the right direction would be fabulous.  

 

Thumbs up to all of you that have helps in the past!  

 

 

 

 

1 REPLY 1
ArtApa
Alteryx
Alteryx

Hi @ibesmond - I will try to concisely answer your questions below:

 

1) Totally! Everything that can be matched precisely must be matched before Fuzzy matching. You do not want to leave it to an algorithm.

 

2) Fuzzy matching is like an art and not an exact science. You have to generate a model based on your data and use the one that gives you the most confidence. The Beauty of Alteryx is that you can quickly try different approaches and decide, which one works the best for you.

 

3) I understand that everything is clear with the Master Data source. The challenge is with the other data sources as they probably have different schemas. Without knowing more and given the information provided, I can only suggest the following:

 

  a) You can build a Fuzzy Matching workflow as per the following training. It will give you both matched and unmatched records as you wanted: https://community.alteryx.com/t5/Videos/Fuzzy-Matching-for-Beginners/td-p/330575 

  b) You will connect to the Master Data source directly.

  c) You can “union” all other data sources using the following Macro: https://community.alteryx.com/t5/Alteryx-Designer-Knowledge-Base/CS-Macro-Dev-Reading-in-Multiple-Fi... 

 

Hope that helps.

 

Cheers,

Labels
Top Solution Authors