Start your learning journey with Alteryx Machine Learning Interactive Lessons
Go to Lessonshello everyone
I am quite new using this tool and I have my first task. I have to figure out how to identify patterns from a given column which contains text information. My goal is to see if there is a pattern in the language used: key words like provide, allocate, update, verify, etc.. For it, I used the tool "Zero-shot Text Classification". It worked but i was wondering, how would you guys build the workflow. In the attached file, the column D is the most important for this task. It contains the actual questions the customers placed us. I want to identify key words used in those questions and group those into buckets (for example Dividends, Capital increases, Splits, add information, remove data, explanation needed, etc.). Finally, I want to relate those buckets with other columns such as the group that got the incident assigned, the time it took to process the query, the product behind, etc. and identify patterns: Group A takes the longest to answer questions regarding updates on product AB, group C is the fastest to resolve incidents regarding dividends, or capital increases but only on product BC; most requests, customer want us to review data on dividends for product CD, etc.). Can I use other tools from Machine Learning or Text Mining to do this in a easier way? Attached is a list with dummy requests from our customers and also a screenshot of the Workflow I used.
Thanks a lot for any ideas!
Cheers
Carolina
don't give up😁
Hi Carolina,
What you are trying to do is called natural language processing. The same type of stuff that powers modern AI chatbots like ChatGPT, but a bit less complex.
The best approach for your application here is the bag-of-words approach, which treats every word in a sentence or the Notes in your case like a random assortment of words. You lose more of the semantic meaning behind the notes, but it is a practical solution here.
I have attached screenshots of the workflow to process this text data into usable feature columns for reporting or predictive analytics.
From a high level, what this workflow does is it:
1. Assigns a row ID to each row
2. isolates the Notes column
3. makes every letter lowercase and removes punctuation marks (you can include these if you want or think they hold some predictive power, but most of the time, text data is processed without these.)
4. splits the blocks of text into their individual words using Regex
5. uses these words as columns and assigns a 1 if the word is present in the notes and a 0 if not
This orientation of the data allows you to use individual words as filters for reports and as categorical (true or false) variables for regression and classification.
The screenshots I have provided show the workflow and the configuration of specific modules that are tuned for this task. I have also provided a screenshot of an example report.
If you just wanted columns with specific words or phrases, you can use the following regex or just use ChatGPT to make it for you:
\b(ipo|dividend|dividends|capital increase|capital increases|split|splits|add information|remove data|explanation needed)\b
Side Note:
If you wanted to retain the semantic meaning of combinations of words and their order for things like "increase capital" or other examples, you would most likely need to rely on a more complex numerical encoding of the phrases. These are often called word vectors, and they kind of look like this: [1.221, 12.3981, 3.127, .... etc]
Alteryx might have this feature in its ML suite, but I am not sure because I do not have the subscription for it.
Good luck!
Devin

