Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Text Mining / NLP / classification - identifying "fund dropper" social care customers

Hi all,

 

This is my first Alteryx community post, so apologies if there's already a thread that has something similar.

 

I'm working with Adult Social Care records, and want to identify whether a residential customer has become a "fund dropper" - essentially their personal funding for residential placement has run out. The data itself is only captured in open comment form, and while the term itself is sometimes used, it's also varied in description, not exclusive to things like: "savings drop", "reaches the threshold", "funds have dropped", "approached the authority for support with funding" etc.

 

My data is a person ID, and an open comment field.

 

I've had a look at the Text Mining tools available, and am not sure which one(s) would help derive either a % likelihood they are a "fund-dropper", or a binary prediction yes/no.

 

Any help or advice would be greatly appreciated!

 

Alistair

8 REPLIES 8
FinnCharlton
13 - Pulsar

Hi @alistairmendeshay , I believe the tool you are looking for is called 'Text Classification'. This tool builds a text classification model based on training data provided to the tool. This means you will need to manually classify a good fraction of your data to be able to train the model - the more data you use for training, the better the model. Have a read about it here: https://help.alteryx.com/current/en/designer/tools/alteryx-intelligence-suite/text-mining/text-class...

 

There is also a tool called 'Zero-shot Text Classification' which requires no training data, but as a result this will likely be far worse at correctly classifying your text.

 

I would also have a think about whether this approach is likely to work for you, as there are a couple of things which will limit the model's ability to classify the text:

 

  1. Is the open comment field focussed (e.g. only talking about funds), or does it include lots of irrelevant information. The more irrelevant information, the worse the model will be. You might be able to mitigate this by finding a way to focus the text, trimming away everywhere it is not talking about money.
  2. How many records do you have available, and how many can you reasonable manually classify for training the model? Not training the model on enough records will limit the predictive power of the model

Thanks Finn, I'll give that a go!

 

It's possible the open comments aren't fund focussed, is it possible to use a tool to identify parts of the comment that are?

 

What is the minimum number of records that should be manually classified, are we talking hundreds or possibly thousands?

Hi again,

 

On the Text Classification tool, can you explain the difference between the 'Training Text' and the 'Validation Text'? What do each of these look like?

 

For info, I've classified a good chunk of data, essentially classifying customers in a binary fashion: fund-dropper or not. I'm just not sure where this pre-classified data fits in, and what I'm missing.

 

Many thanks for any advice!

 

Alistair

apathetichell
19 - Altair

Start with data that you know the value of - ie you know if they are "fund droppers." You split this data into two groups (usually 75%/25%) - Training data is your known outcomes set - this is how you train your model. Once your model has been built (trained) - you validate your model (ie test it's accuracy) on the validation data. If you are comfortable that your model works with the margin of error you expect you can then deploy it to new data.

 

1) Train

2) Test

3) Deploy

apathetichell
19 - Altair

Also - I'd look into logistical regression - this may be more of your fit and may provide faster results.

Many thanks, so would I be right in thinking the 'Validation Text' is the remaining 25% in your example, I already know what each record should be classified as, and the model tests it's decision against mine, effectively?

apathetichell
19 - Altair

YUP. you should then develop a confusion matrix. - https://en.wikipedia.org/wiki/Confusion_matrix

Nour_Ama
5 - Atom

Hello, 

Did you find what you are looking for?

Labels