i have a very large data set (more than 3 lakhs rows) very two columns are unstructured means (same component mentioned in various format) Aim is to filter out with respective to individual components. Attached sample for reference.
@Sriram369 What would the desired output look like?
Hi,
I would start with Summarize Tool (grouped by Remarks, Count: Item Name) then Sort Tool : Count - Descending.
This will allow you to look at most common Remarks, and maybe come up with some Filter Tools: for example Contains([Remarks],"STORE") to create some subsets of the original data and then group by specific category.
It depends what you need, what kind of details you need.
Karolina
hi @Sriram369
it feels to me that you want to solve a classification. Based on my personal experience, you need to get the “key word” list from a domain expert to narrow down the field into the key categories and the iteratively whittle down the “residual” unmatched.
If you have access to the Word Cloud tool, that may be one way to get the initial key word list.
dawn