Data Science

Treyson · ‎08-25-2020

Blog Banner.png

Featured on the Data Science Portal.

When I was asked to do a review on the text mining tools, I jumped at the opportunity. I hadn’t used them yet and since I have been using Alteryx, this is the very first completely new tool category to be introduced. I spent all my free time over the past few days playing around with the tools and Googling and ultimately ended up lost.

Ya see, I would not classify myself as a data scientist (or a citizen data scientist). I am very heavy into databases and structure and processes and data flow. I can think analytically, I can look at data and tell you what I see, but the problem that text mining solves is not really in my wheelhouse. Then I sat down with Erica Reuter, an Alteryx Solutions Engineer, who is everything I am not as an Alteryx user. She really helped clarify what is going on, and the world makes a lot more sense. So here is the good and the bad of what I found.

The Good

Out of the gate, I want to tell you that I have never seen a set of tools with less interface on their setup. Each one is designed with a specific function. You really just have to point to what you want to process, and you are good to go. Everyone is reading this for the highlight reel, so let’s just jump right in.

Text Pre-Processing

This tool is super cool. It allows you to strip out stop words (and, is, but, etc.), so that you aren’t seeing these things in your analysis. You don’t care that AND is a giant word in your cloud, so why keep it in there? The tool has a nifty function that allows you to add your own stop words, so if your company name shows up a lot in survey data and you don’t care that it shows up a lot, get it out of there.

Topic Modeling

Once someone held my hand through pretty much this entire tool, the results were cool. This tool allows you to break your word count down into topics and analyze the relevancy of the words to one topic as compared to the others. You can plug in your own topic count, which is something that you have to play around with to understand how much overlap exists and when it does. Do we want to decrease the number of topics, etc. — the things an analyst would know that I don’t. You can then go on to use this data to pinpoint what you might need to focus on in certain areas… like if your Midwestern shipping division has a lot of feedback around on-time service, you might have a problem.

Sentiment Analysis

Something about Darth Vader… I dunno. There are four outputs from this tool: negative, neutral, and positive sentiment scores and a compound score. If you add the negative, neutral and positive scores, the total is one, so each column is how much that thing it is (what a terrible sentence). The compound score assigns an overall sentiment between -1 (ALL CAPS ANGRY) and 1 (HAPPINESS!!!!!!!). The results themselves are valuable, but the real value is when you sift through the results and adjust the real range of negative scores. You can then filter on the ranges downstream in order to take actions against certain sentiments.

PDF Reader (3 tools)

This one I actually figured out, but it took me a minute. You can build this out with either two or three tools, depending on what you want to do. The PDF Input allows you to select a folder, so you are inputting however many PDFs exist in that folder, and you must be cautious of that.

The results don’t provide any value on their own, but they must be passed into the Text to Image tool, which pulls out the text within the document. It reads digital text (like a Word document saved as a PDF) very well. Image text and handwritten text are not without their problems, but they can be done.

There is a third tool that can be added, called the Image Template. Here is where the possibilities get wild. On this tool you can select pixel ranges and name them. Those names will be the headers to your output. This, when fed into the “T input” of your Text to Image tool, creates a template of how it is going to process the documents coming from your PDF Input.

The example that I want to see built out is someone who has a ton of form documents that were all rendered to be formatted EXACTLY the same. A great example would be W-2s (or other tax docs). You could format the first one to say this section is box 1, this section is box 2, etc., and then mine that from all the documents in that folder. You don’t have to sit there and manually enter all of them. The key to this is the formatting. Since the tool is pixel based, the information has to be in the same spot. These tools, I would say, are weird in their design. They seem to have been created with a use case in mind, where rather than grabbing one PDF and just reading it, we must have it in a location with no other PDFs inside. If multiple PDFs exist in a folder, it will process all of them, which depending on the amount of files and pages within those files, can take a long time to do. That doesn’t make sense on the first pass, but when you think about these massive intakes of PDFs, it starts to make sense.

The Bad

Going back to the fact that I don’t know what I am doing… I did some bad things. Things I would not have known about unless guided.

Word Cloud

I want someone to give a good reason that word clouds exist. @EricaR mentions that she uses it to quickly find stop words that haven’t been removed with the Text Pre-Processing tool, which is fair, and I will use it that way. You can use an image to create the shape of your cloud, so that’s cool.

Text Pre-Processing and Sentiment Analysis

You might remember that I said that those tools were good and they are, but together they are bad. The Text Pre-Processing tool can strip out a lot of how the sentiment analysis (VADER) works. Since I had no idea of that, I was removing much of the sentiment from what we were analyzing, which would give bad results. One thing I noticed is that there were some surveys that used sarcasm or very calm text to tell an angry story, and those didn’t get picked up as negative. That wasn’t the fault of the tool or the model, but a result of what I was doing. Here is a little bit on this from @EricaR :

Why wouldn’t I pre-process my data before running it through sentiment analysis?

The Sentiment Analysis feature found in Alteryx’s intelligence suite uses an Algorithm called VADER (Valence Aware Dictionary and sEntiment Reasoner). VADER is a rule-based algorithm, and you will soon see how the rules it employs do not mesh well with data cleansing or text pre-processing. In a nutshell, VADER goes through a routine of running 6 steps:

1 – Check the words presented against the built-in lexicon. VADER looks for words it knows like “horrible” and realizes that is coded as negative.

2 – Evaluate punctuation. If punctuation is present, it can change the sentiment score. For example, “that’s horrible!!!” will score as more negative than “that’s horrible.” When we use a data cleansing tool or pre-processing tool, the punctuation may be stripped out, negating this step.

3 – Evaluation of capitalization. Again, this is calculated in emphasis. “THAT’S HORRIBLE!” will be perceived as more negative than “that’s horrible!” Are you seeing a pattern here? Data cleansing may change all of the data to the same case.

4 – Examine modifiers. Sometimes we speak of something and modify the meaning with a word before or after the one being examined. This can change the sentiment. Clearly “that’s horrible” and “that’s not horrible” have different sentiments. If we entered stop words in the pre-processing tool, there is a good chance these modifiers would be missed.

5 – Consider shifts in polarity. The English language is complex. We often have sentences with both positive and negative sentiment expressed. This is called a shift in polarity. Think of the classic practice of softening criticism: “Hey I think you’re really wonderful, but it bothers me when you are late.” This sentence starts positive and ends negatively. VADER looks at that to determine what is the true intent. Hint… it’s almost always the latter end of the sentence.

In short, I think that the developers over at Alteryx have done a good thing. They have created a series of tools that work and function the way I expect Alteryx to function. If I were a citizen data scientist and I wanted a quick and dirty way to do this work, I would be stoked at how simple it is to set up. But since I am not that, this is probably a toolset that I will hold off before really using. Understanding that this is the first iteration of the tool, there is a lot of room for features to be added that help guide us as users through this analytical process. Making the toolset intuitive is incredibly valuable, and I believe is what has made Alteryx successful. I have avoided the predictive tools for a long time for this same reason. Fortunately, we now have the Machine Learning toolset which attempts to solve this exact problem, and if I ever get invited back to write for Alteryx, my next article will dive right in.