This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Is there a way to keep only the longest values of a unique sequence? For example, if we have the following list:
Test Test123 Test12345 Test67 Test689 Example
We would want to be left with only Test12345, Test689, and Example. The other ones which are substrings would be filtered out.
With a large dataset, is there an automated way to check if it is a substring against all other values in the column to decide if it should be kept or removed? I have been leaning towards using formulas, filters, and fuzzy match, but haven't figured out exactly how to do what I want.
Hi @annhood like you said, fuzzy matching will probably solve your problem. I'm not very good at it, so I came up with this solution. It does involve a cartesian join, so with a large dataset, it might not be the most performant solution.