Dearest Alteryx Community... can you help me please?
I have a table of data that in certain columns has a variety of data misspellings that I need consolidating into a single agreed spelling for all matches.
For example
Business School |
School of business |
Busness |
Business Schoool |
Business Faculty |
Business department |
Business Enterprise |
Business clients |
There are many matches that need simplifying to just 'Business School'. To make matters harder, there are some departments like Business Enterprise and Business clients that are not part of the business school and so can be left alone.
Any idea how to go about this? I am thinking something along the lines of fuzzy logic, but as a very n00b to alteryx I am unsure of the best way to proceed.
I also need to repeat this with several other departments that have been 'interestingly input' into the data and need cleansing.
TIA
Solved! Go to Solution.
Hi @Joel_Mills ,
I suppose your case is related to Named Entity Recognition.
Alteryx has Named Entity Recognition tool as a part of Alteryx Intelligence Suite .
Unfortunately I do not have the license and so cannot support you further.
You may check this page
https://help.alteryx.com/20223/designer/named-entity-recognition
and consider purchasing the add-on license if you think it fits with your case.
Good luck.
Hi @Joel_Mills,
I have done a similar project before and what we used was a combination of fuzzy matching and look up dictionary.
The hard part, like you suggested is that some semantically similar names such as "Business Enterprise" are very similar to Business School, and no fuzzy match alone can tell them apart.
So my suggestion is to filter out the cases where you know that don't belong to your group (in this case business school), and then fuzzy match on the remaining.
I fixed this in my own way by using find and replace based on a list of cleansed department data that I created and manually edited. Where department was the original in one column, I then manually matched all similar with a standard, agreed new value in a 'clean' column. Then it was easy to use find and replace on that file once I brought it in to Alteryx.