Hello everyone,
Today I have a slightly complex (for me!) task to accomplish: while using 2 public datasets which contain information about UK properties, I need to join them together based on their "Address" field. The problem is that these addresses are slightly different, for example:
I need to clean the data somehow first and then join the tables, or do an "approximate" join. I tried the Fuzzy Match tool but unfortunately I didn't get any results. Some ideas?
The sample data is attached below. I also included a Postcode column, but there are a lot of nulls so I'm not sure if that can be used properly.
Thank you for your help.
Hi @Amalainic ,
If you have two datasets that don't share a common key , then you are basically looking at Fuzzy Matching, there is no other way.
The first thing to do is join them both on the postcode field, which you might need to standardise first (remove spaces, make uppercase etc.)
Then, take the output of the left and right, representing the unmatched records, and they represent source 1 and 2 in the workflow I've attached.
Fuzzy matching is not an exact science, hence fuzzy, but in the example I've attached using your two examples, you can see the matches are applied with a high degree of likelihood.
I hope this helps.
M.
Thank you for your help. It's weird that when I use my data (even though I sampled just 5-10% of it which is 80-100k rows) the fuzzy match tool gets stuck at 51-52%. Any ideas why? I waited a lot for it to compile but no luck.
You can use my sample data that I posted, the 2 excels, if you want to try.