This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I have an interesting problem to try and solve. I am doing fuzzy matching and the result of the match set shows like the "Source" and "Match Candidate" down below. I find that the same combinations are found, if i was to try to feed this into a system to make changes it would error out for the reasons shown below.
For the example:
1. once a match candidate is matched to a source, it stops being a record.
2. These records would not be grouped together in the data set, but are ordered so that the first Source name (Robert) would be above the other Source names (Bob, Bobby, Rob) to ensure that the lower level aliases get matched to the appropriate Source. This logic would be applied to all names in the set.
3. This example shows 1 Set of records within the entire database. for example, there would be other names like "Fred" and "Fredrick" that would not match to Robert.
Put a unique tool on the match set to remove same combinations. Then, sort descending on the score and use a summarise tool to group by the ID and the highest score to take the first score. Then join back to the original data to obtain the remainder of the dataset.
Are you saying after my sort to scan the "Match Candidate" Side and remove duplicates? I think this would work to eliminate 60-70% of the duplicates, the only issue remaining then still is to remove the "Robert" from the Match Candidate column...
In looking at the hypothetical result I think for the 2nd part I will just do a filter on if the Source is found in the Match Candidate then false, which I believe is the same idea as you are trying to illustrate more or less?
Remove the duplicates using Unique tool - it will not remove 60-70%, it will remove 100%. If you do not want Robert as a match candidate filter it out before the fuzzy match or up the threshold percentage so it does not match.
In my use case, "Robert" has the ability to be on both sides of the match. This is due to the way I am matching and is very intentionally designed. The generic fuzzy match score mechanic is not being used as I created a customized scoring algorithm to have a finer control for my project than is offered in the tool.
That being said, there is a series of checks and functions that happen after the fuzzy match to align which record is on the "Source" side and which is on the "Match Candidate" side and sort them specifically. The actual data is a massive fluctuating data set with many sub sets and custom rules.