Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Text Classification Problem - Not sure How to Start

Dubya_dup_93
5 - Atom
Hi Everyone, I am very familair with alteryx but less with the predictive tools, and am struggling to design an ML process that will look at the parts of a products written descriptions and classify whether each part is relevant to the product or is not about the product itself, so I can mark them as anti/non relevant so as to guide and improve search indexing and stop false positive matching. 
 
Eg 
Product is: 18-karat White Gold, Sapphire And Diamond Earrings
Category is: Jewelry and Watches
Description Line 1: these earrings are cast from 18-karat white gold and set with clusters of sapphires and shimmering diamonds. (Relevant = 1)
Description Line 2: wear them with tonal dresses or jeans and a t-shirt. (Relevant = 0)
 
I have a data set of about 20k products where one or more sentences have been identified as not relevant, which has been developed using simplistic but effective rule sets (ie if contains 'wear it with' then 0). This has been helpful but needs to evolve to something more fluid & ML based, so as to cope with the increased variance in the way irrelevant text could be composed and the use the products context to guide it.
 
I was considering using this rule derived data set to train an ML algo, in the hope that it can learn this and all the subtleties and nuances of it for itself, but I don't want it simply just relearn the re-learn the rule set. I need it to be able to draw more subtle inferences between relevance or lack of it in certain sentences in the dataset, and the composition of the product title, category, sentence order etc and how its seen those interact in other examples.
 
Here's an example of my dataset I was going to use to train with, but I'm unsure of how best to structure this and apply it to a particular algorithm. It feels similar to sentiment analysis piece.
 
 
Dubya_2-1619182801948.png

 

Can anyone give me some pointers on best algo choices / tools, and structuring the data / configing the algo to steer it.

 

Many thanks in advance as always!

 

w

2 REPLIES 2
TrevorS
Alteryx Alumni (Retired)

Hello @Dubya_dup_93 
Thanks for posting to the Community!
Are you able to share your workflow with some sample data so the Community can see what you have attempted so far?

This will allow the Community to better troubleshoot directly where you are having trouble.
thanks!
TrevorS

Community Moderator
cgoodman3
14 - Magnetar
14 - Magnetar

Are you looking to do this natively in Alteryx or do you have experience with languages such as python?

 

I'm going to caveat upfront I am not a data scientist, but have done a little exploratory project work on text analytics. It looks like you need to do something along the lines of topic modelling (look at the relationship of words to draw out relationships) and named entity extraction (identifying the context of a word, for example knowing if you talk about Apple in the context of iphones and tech then Apple is company and not a fruit).

 

There is a brute force method of creating a library of words and then doing a count of matches against descriptions, and then setting a criteria which gets you to a relevant / not-relevant accuracy that you are comfortable.

 

Looking at stuff in Alteryx

The Intelligence Suite add-on for Alteryx has the ability to do unsupervised topic modelling but given you know what are relevant descriptions and what are irrelevant then I don't think this will really help here. What IS would allow you to do is read in all the descriptions and based on the LDA model it would score the descriptions to a particular topic so you might have it identify topics which talk more about how you would wear an item versus trends versus talking about materials and manufacturing topics.

 

Not native Alteryx, but can be coded in

Stuff like named entity recognition is not included in Alteryx as a tool, but within Alteryx you could leverage either services like Microsoft Congnitive Services or python packages such as Spacy.

Spacy has pre-trained models which you can use, for example labelled datasets based on wikipedia articles or there is the ability to train a model, but I've not explored this.

 

While no specific answers, hopefully this gives you some addition insight into text analytics and areas to explore.

 

Chris
Check out my collaboration with fellow ACE Joshua Burkhow at AlterTricks.com
Labels