Hi all,
This is my first Alteryx community post, so apologies if there's already a thread that has something similar.
I'm working with Adult Social Care records, and want to identify whether a residential customer has become a "fund dropper" - essentially their personal funding for residential placement has run out. The data itself is only captured in open comment form, and while the term itself is sometimes used, it's also varied in description, not exclusive to things like: "savings drop", "reaches the threshold", "funds have dropped", "approached the authority for support with funding" etc.
My data is a person ID, and an open comment field.
I've had a look at the Text Mining tools available, and am not sure which one(s) would help derive either a % likelihood they are a "fund-dropper", or a binary prediction yes/no.
Any help or advice would be greatly appreciated!
Alistair
Hi @alistairmendeshay , I believe the tool you are looking for is called 'Text Classification'. This tool builds a text classification model based on training data provided to the tool. This means you will need to manually classify a good fraction of your data to be able to train the model - the more data you use for training, the better the model. Have a read about it here: https://help.alteryx.com/current/en/designer/tools/alteryx-intelligence-suite/text-mining/text-class...
There is also a tool called 'Zero-shot Text Classification' which requires no training data, but as a result this will likely be far worse at correctly classifying your text.
I would also have a think about whether this approach is likely to work for you, as there are a couple of things which will limit the model's ability to classify the text:
Thanks Finn, I'll give that a go!
It's possible the open comments aren't fund focussed, is it possible to use a tool to identify parts of the comment that are?
What is the minimum number of records that should be manually classified, are we talking hundreds or possibly thousands?
Hi again,
On the Text Classification tool, can you explain the difference between the 'Training Text' and the 'Validation Text'? What do each of these look like?
For info, I've classified a good chunk of data, essentially classifying customers in a binary fashion: fund-dropper or not. I'm just not sure where this pre-classified data fits in, and what I'm missing.
Many thanks for any advice!
Alistair
Start with data that you know the value of - ie you know if they are "fund droppers." You split this data into two groups (usually 75%/25%) - Training data is your known outcomes set - this is how you train your model. Once your model has been built (trained) - you validate your model (ie test it's accuracy) on the validation data. If you are comfortable that your model works with the margin of error you expect you can then deploy it to new data.
1) Train
2) Test
3) Deploy
Also - I'd look into logistical regression - this may be more of your fit and may provide faster results.
Many thanks, so would I be right in thinking the 'Validation Text' is the remaining 25% in your example, I already know what each record should be classified as, and the model tests it's decision against mine, effectively?
YUP. you should then develop a confusion matrix. - https://en.wikipedia.org/wiki/Confusion_matrix
Hello,
Did you find what you are looking for?