Hello,
I am trying to perform text extraction based on unstructured data purely by machine learning (I recently purchased the intelligent suite). My dataset looks like something below. For each customer, they write unstructured text about two topics, topic 1 and topic 2. Each of these topics are the same for each customer, but the way they are written is highly variable with keywords that often intersect both topics. Instead of spending days trying to figure out NLP rules, I'm looking to take a different approach by simply training a model similar to how monkeylearn does it: youtube.com/watch?v=5xhvJls8b78&list=PL4yw9SBwClHQSzMHZEX4zhMvwiAKFtiE3&index=3 to tell me where the delimiter between the two topics are. What is the best way to do this in Alteryx? I've struggled getting this to work right with the classification tool.
Customer | Unstructured Text | Desired output 1 | Desired output 2 |
1 | Bunch of text about topic 1 … bunch of text about topic 2 | Bunch of text about topic 1 | bunch of text about topic 2 |
2 | Bunch of text about topic 1 … bunch of text about topic 2 | Bunch of text about topic 1 | bunch of text about topic 2 |
3 | Bunch of text about topic 1 … bunch of text about topic 2 | Bunch of text about topic 1 | bunch of text about topic 2 |
4 | Bunch of text about topic 1 … bunch of text about topic 2 | Bunch of text about topic 1 | bunch of text about topic 2 |
5 | Bunch of text about topic 1 … bunch of text about topic 2 | Bunch of text about topic 1 | bunch of text about topic 2 |
Edit: Really what I'm looking to see is if Alteryx can perform NER (Named-Entity-Recognition) using a training set.
In your unstructured text field how long are the text fields? And are they always written as topic one followed by topic 2? Could you split this out into chunks based on sentences. Then in 2021.2 onwards the topic modelling tool allows you to score text against a previous topic model, using the scores you would be able to see the scores for topic 1 decreasing and topic 2 increasing and you can use the cross over point as your delimiter?
For NER this isn’t yet in IS but I am hoping it is added soon. There’s python packages such as SpaCy which you can train and this might be easier to identify ‘topics’ is you know the first one talks about people and the second one talks about organisations, as an example. There’s also stuff provided by the cloud providers. An example is this post which uses Azure https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Alteryx-Text-Analytics-Entities-Extrac...
Named Entity Recognition is in beta testing now. 🙂
Great news on NER! Any update on this?
NER has been added to the 2021.4 release and it has a visual output labelling each identified entity 👍