I've been working with a variety of person specific datasets which I'm gradually growing and I'm wondering if anyone has figured out a best practice way to cleanse and blend when the identifying fields are inconsistently populated (I'm not talking about fuzzy matching just inconsistencies).
Also give me a shout if this is something you are interested in having or are interested in helping me work through.
For example:
- Initials vs whole names
- Full name vs names in parts
- Maiden/former names
- No middle name vs unknown middle name
- Middle name is in last name field
- Case sensitive or not
- Prefixes/suffixes
- Special characters
- Indirect identifiers (e.g. you know that they can't have worked before their birthday)
For my specific use case I will be using the data for machine learning model and I think there will be enough data to allow some mismatches but I guess what I'm looking for is a way to make it is easy adjust as I learn what the data means.
Below is what I've thought up so far but it feels like something someone might have already thought off.
So I'm thinking 3 steps:
1 - Standardize the data
2 - Identify uniqueness
3 - Perform join on acceptable level of uniqueness
1 - Standardize the data
Change the fields to:
- Last Name
- Last Name type - Full or Initial
- First Name
- First Name type - Full or Initial
- Middle Name
- Middle Name type - Full or Initial or unknown (i.e. Full or initial but with blank middle name would means the person doesn't have a middle initial which might know based on other information or the observation of other data in the same dataset)
2 - Identify uniqueness
If a dataset has a last name which is unique then it doesn't need any other identifiers to differentiate it from others. When you perform the join it might be helpful to use this as a way to identify where mismatches might exist (in fact perhaps this could be used in the Machine Learning model to give ambiguous matches less weight).
3 - Perform join
Wishfully hoping that someone might have developed a macro to circle through combination of columns that might used to join and provide:
- The best match (and this might be limited by some input parameter)
- Some scoring, comment information which could highlight where there might be other plausible matches which might help with error identification
@EnglishmaninNY I have had mixed success with the Fuzzy Match tool to return names that are matching each other.
You can adjust the Configuration settings to potentially purge data that does not meet a certain match threshold. If you set that Match Threshold field really high (95-98 percent), you'll be able to get a composite list of unique names to support Steps 2 and 3 listed above prior to finishing out your initial data prep.
But, I'd ask how much of the name deviation is known already (e.g. Zack Dodson vs Zack Dodsun, etc.)? I ask because that'll also influence whether something like a Regex_Match function in a formula tool, Regex, or some other method will give you the specific results you're looking for prior to the heavy lifting in your ML model.
Thanks for prompt response.
As I said fuzzy matching doesn't really solve the problem because I'm not seeing typos or misspelling but rather inconsistent fields.
I have 100+ datasets and I want make the data scrubbing as automated as possible.
Assuming they have different file paths, you can:
1) Have a Directory tool bring in the file paths,
2) Have a macro that brings in the data sets individually, then Transpose the data set before outputting through your Macro output tool. Since now your columns are now vertically oriented in Name/Value columns, you can develop Regex logic to isolate the pattern and create a new categorical field that is aptly named against that pattern before,
3) Cross tabbing the data on your new categorical label (Headers) with the data coming from the Values column.
That should get you to the point where at least your data fields are standardized across the 100+ files.
I would only utilize the Fuzzy Match to get the match score between Field 1 and Field N (answering the requirement in 3.2 from above).
Best of luck! Definitely super curious if @binuacs or @caltang ever solved something like this - tagging them because I see them post heavily and always have great solutions.
Thanks for the tag @rzdodson !
I think what would be helpful is if we can get some sample data at least @EnglishmaninNY.
Fuzzy Matching isn’t the answer here as you mentioned, this requires some heavy tweaking to match word for word and get a % figure of match based on position as well.
As requested here is some sample data:
- Sample from government employee salary database (this is actually combination of data from 2 state Oklahoma and Kansas and if sample so that we have some interesting examples)
- Sample from a physician listing (Again this is a sample but it is limited to physicians appear to practice in Oklahoma and Kansas)
I've also included:
- The simplest matching based on last name and first letter initial for first name and middle names
- Snapshot of the results where I've highlight matches where evidently last name and initials is insufficient to identify an appropriate match:
- Yellow because its coming up with multiple plausible matches
- Light blue because First names are different
We probably need a bigger data set to experiment on the uniquiness calculation but as an example C Davis I think you can assume is a more common combination than K S Biard and so we can likely assume that later is a match but the former might not be.