Alteryx Designer Desktop Discussions

mshinn9999

Hello all. I am trying to figure out the best approach for this. I have File 1 (with a single field Name) with say 300K records. Then, I have another file which sources from another application with the same Name field (but has only about 1K records). I am trying to understand how likely there would be close matches in the Name field between the 2 separate files.

Is the Fuzzy Match tool (MERGE) the best approach here? Or, is there something a bit easier tool to leverage here? I did not want to over-complicate this.

Mike

jdminton

I would say possibly. The fuzzy match tool can be very helpful. but you will often get false positives even if everything is set up correctly. Why are you considering fuzzy match to straight join? I'm guessing there are probably formatting differences or possibly variation in user input, etc.

mshinn9999

Hello there. To add more color here, I will need to create a calculation based on the 2 different files (data sources). TBH, I would like to avoid the Fuzzy Match all together. I can use the combination of the Data Cleaning/Formula tools to have the data format be quite similar. I guess I'm trying to see if there is something similar to say a Vlookup to get a sense of what percentage of likely matches there are between the 2 files. I hope this makes sense. It's a data profiling exercise.

OllieClarke

@mshinn9999

If you're comparing a single string field between 2 datasets, then the Find and Replace tool can be useful here. It has more flexibility than a join tool (can find partial matches, or case insensitive ones), and if you use it in append mode, then a browse tool after showing the % of non-nulls in the appended column will tell you what % found a match.

jdminton

1000 records is getting a bit large for find/replace in my opinion. I would recommend using Uppercase() to remove any case issues with name. It sounds like the names should be pretty similar. The join tool will provide any fallout and you can use a fuzzy match on the fallout from both sides to identify possible issues if desired.

mshinn9999

Thank you both for the replies. Attached is sample workbook. How would this look like per implementation? Data of course is all dummied up.

Mike

OllieClarke

@jdminton on my machine with 300k/1000 records the Find and replace is still significantly faster than the formula+join combo, plus it has partial match capabilities.

There will be a tipping point where that might no longer be true though

OllieClarke

Hi @mshinn9999

could you possibly share the excel your workflow references too please?

mshinn9999

Yes of course and thank you for the assistance here. Here are all of the relevant files.

OllieClarke

@mshinn9999 here's how the Find and Replace tool could work (either at the end of your data prep, or even earlier on)

Alteryx Designer Desktop Discussions

Comparing 2 different files (same field)