Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Data Science

Machine learning & data science for beginners and experts alike.
SydneyF
Alteryx Alumni (Retired)

If you read my Word2Vec article from a couple months ago, you may have deduced I’ve been dabbling with the wild world of Natural Language Processing in Python. One of the NLP models I’ve trained using the Community corpus is a bigram Phrase (collocation) detection model using the Gensim Python library. Phrase detection models are neat because they find common phrases in your text (e.g., “Alteryx Server” or “Formula tool”) based on how frequently a pair of words occur together relative to how frequently the words occur independently in the text. Often, phrases have a different meaning than a combination of the individual words they are made up of.

 

After training a Phrase model, you can apply it to a set of words (i.e., a document) to convert identified phrases into single tokens by adding an underscore between the two words. This means when you tokenize your document, the phrases will become their own entities, which is handy when you are doing processes like word count, word matching, or really any type of text mining. 

 

The default equation used to determine bigrams in the Gensim Phrases() function is the same one Mikolov et al. proposed in their paper Distributed Representations of Words and Phrases and their Compositionality.

 

 2018-10-17_9-14-53.png

 

After training a Phrases model with Community texts, I wanted to be able to incorporate the model into Alteryx workflows that I was using to process text, and hopefully even be able to share the model with other Alteryx users. After thinking through this, I realized it was a perfect application for the Python SDK.

 

Full disclosure, I was quickly enchanted by the Python SDK when it was first released. It feels very cool to be able to develop custom Alteryx tools with Python code. It’s a creative and open way to extend the Alteryx platform. The following is a very simple example of how to use the SDK to deploy a trained model. I am really excited to see what you all do to expand upon it. On that same note, If you need help getting started more generally with the Python SDK, please take a look at @NickJ's article Levelling Up: A Beginner’s Guide to the Python SDK in Alteryx, or @NeilR's Text Analysis in Alteryx with the Python SDK: Gender Classification. Both of these articles are priceless for getting started with the SDK. In this post, I will be skipping over a lot of important content explicitly covered in these two blogs, so don't hesitate to refer back to them if you get lost. As both Nick and Neil mention, the easiest way to work with the SDK is to modify a pre-existing SDK example tool. In this scenario, the Single Input Output tool is a fabulous starting point.

 

As you may or may not know, the Python SDK processes data a single record (row) at a time. This means that the SDK will bring in a Row of data from an Alteryx data stream, perform a process on it, and output a single row. The SDK will then perform the same process on the next row of data. Although this method may not be ideal for training a model, where you (typically) need access to all of your rows of data at once, it’s totally amenable for creating something like a Score tool.

 

The Alteryx Score tool is clever because it is somewhat model-agnostic. It is able to do this by identifying the type of model being fed into the O anchor and using the appropriate corresponding code to create predictions with it. This is not what we will be building today, but it could be something you leverage this article as a starting point for. What we will be building out today is effectively a text data pre-processing tool, that strips punctuation, makes all letters lowercase, and creates bigrams. 

 

The great trick in all of this is shipping our pre-trained Phrases model with the SDK tool so that when it is shared and installed on other computers, it is able to run successfully. This means placing a copy of the saved model file into the tool's folder, and including it in the folder when creating the .yxi, which will have a folder structure like this (after renaming the Example Tool components):

 

  • Config.xml
  • Phraser
    • PhraserConfig.xml
    • Phraser_Engine.py
    • Phraser_Gui.html
    • Phraser_Icon.png
    • requirements.txt
    • bigram.model

 

The interface for this tool is really generic  - it is a simple drop-down widget used for field selection. The HTML file for our tool (used to specify the configuration options) looks like this:

 

<!DOCTYPE html>
<html style="padding:20px">
<head>
  <meta charset="utf-8">
  <title>Word to Vector</title>
  <script type="text/javascript">
    document.write('<link rel="import" href="' + window.Alteryx.LibDir + '2/lib/includes.html">');
  </script>
</head>
<body>
  <label>XMSG("Select Text Field to Convert")</label>
  <ayx data-ui-props="{type:'DropDown'}" data-item-props="{dataName:'FieldSelect', dataType:'FieldSelector'}"></ayx>
</body>
</html>

 

And results in a Configuration Window with a single drop-down field selector.

 

2018-11-26_10-29-20.png

 

The first step is for the Python script to load in the necessary packages. We add this at the start of the SDK script. 

 

from gensim.parsing.preprocessing import remove_stopwords
import gensim
import numpy as np

import re
import os

 

Next, we can modify the __init__ and pi_init sections of the SDK code to create an output field large enough for our texts, and make sure our field selection (FieldSelector) is getting read in for the input.

 

 def __init__(self, n_tool_id: int, alteryx_engine: object, output_anchor_mgr: object):
        """
        Constructor is called whenever the Alteryx engine wants to instantiate an instance of this plugin.
        :param n_tool_id: The assigned unique identification for a tool instance.
        :param alteryx_engine: Provides an interface into the Alteryx engine.
        :param output_anchor_mgr: A helper that wraps the outgoing connections for a plugin.
        """

        # Default properties
        self.n_tool_id = n_tool_id
        self.alteryx_engine = alteryx_engine
        self.output_anchor_mgr = output_anchor_mgr

        self.output = "phrases"
        self.output_type = Sdk.FieldType.wstring
        self.output_size = 10000

    def pi_init(self, str_xml: str):
        """
        Handles building out the sort info, to pass into pre_sort() later on, from the user configuration.
        Called when the Alteryx engine is ready to provide the tool configuration from the GUI.
        :param str_xml: The raw XML from the GUI.
        """

        if Et.fromstring(str_xml).find('FieldSelect') is not None:
            self.field_selection = Et.fromstring(str_xml).find('FieldSelect').text
        else:
            self.alteryx_engine.output_message(self.n_tool_id, Sdk.EngineMessageType.error,
                                               'Please select field to analyze')

        self.alteryx_engine.output_message(self.n_tool_id, Sdk.EngineMessageType.info, self.field_selection)

        self.output_anchor = self.output_anchor_mgr.get_output_anchor(
            'Output')  # Getting the output anchor from the XML file.

 

Then, we skip to the ii_push_record section, which is responsible for sending the records out and is called when an input record is being sent to the tool, and is where the bulk of our processing code goes. Thinking about this in the context of row-by-row processing, this part of the SDK is called repeatedly for each input row of data, where it processes these rows one at a time.

 

The bulk of our custom code is positioned after a conditional statement that makes sure the incoming record is not empty.

 

 def ii_push_record(self, in_record: object) -> bool:
        """
        Responsible for pushing records out
        Called when an input record is being sent to the plugin.
        :param in_record: The data for the incoming record.
        :return: False if method calling limit (record_cnt) is hit.
        """
        # Copy the data from the incoming record into the outgoing record.
        self.record_creator.reset()
        self.record_copier.copy(self.record_creator, in_record)

        if self.parent.input_field.get_as_string(in_record) is not None:

 

First, we need to define our file directory as a variable, which we are able to do with the os library. This is what makes it possible to dynamically find the location of the bigram on different workstations. We then use our dirname variable to load our Phrases model.

 

dirname = os.path.dirname(__file__)

bigram = gensim.models.Phrases.load(os.path.join(dirname, 'bigram.model'))

 

Once our model is loaded and ready to go, we can load in a row of input data.

 

# load in text data as string
line = self.parent.input_field.get_as_string(in_record)

 

The Phrases model is expecting a list of words as an input. We can convert our string input to this format by split() -ing on whitespace characters in a list comprehension. Also at this time, we can make all of our characters lowercase, and strategically strip a lot of the punctuation with a series of fancy (or poorly written- it's hard to know sometimes) regex statements.

 

# parse and strip punctuation
words = [x for x in re.sub('\s_\s', " ", re.sub(r'[\.]', " ", re.sub(r'[\s+]', " ", re.sub(r'[-]', "_", 
re.sub(r'[\"\'\|,:@#?!$"%&()*+=><^~`{};/@]', "", line.lower()))))).split() if x != " "]

 

Now that we have a cleaned up list of words, we can run it through our bigram phrases model. 

 

# convert list of strings to include bigrams
bigrams = bigram[words]

 

To be able to write this data out as a single field, we need to join our new list of words back together.

 

output = ' '.join(bigrams)

 

And then we can write them out, returning to the generic SDK code. The rest of the SDK code should be good to go!

 

self.parent.output.set_from_string(self.record_creator, output)
out_record = self.record_creator.finalize_record()

 

That's pretty much it! This tool will read in the model included in the file directory using the os library. When you package this tool to a .yxi file, it will include the model and read the model in the same way on whatever machine the tool is installed on.

 

Here is the tool in action:

 

2018-11-26_10-41-33.png

 

Before Text Processing:

 

2018-11-26_10-46-37.png

 

After Text Processing:

 

2018-11-26_10-47-00.png

 

We can tokenize the after text using the spaces and a Text to Columns tool, and then do a simple word count with a Summarize tool.

 

Filtering out the Stop Words (which you can do with an actual Filter tool, a Join tool with a Stop Words input, or add the functionality to your SDK tool 🙂 ) We see what some of the top words and phrases in our Designer KB are.

 2018-11-26_11-02-30.png

 

As I mentioned at the beginning, this is a really simple example of embedding a model in a Python SDK tool, but the same idea could be leveraged for many different applications. I am so excited to see what you all come up with!

 

If you would like to have a copy of the tool for your very own (to use or take apart, whatever works), you can download it here.

Sydney Firmin

A geographer by training and a data geek at heart, Sydney joined the Alteryx team as a Customer Support Engineer in 2017. She strongly believes that data and knowledge are most valuable when they can be clearly communicated and understood. She currently manages a team of data scientists that bring new innovations to the Alteryx Platform.

A geographer by training and a data geek at heart, Sydney joined the Alteryx team as a Customer Support Engineer in 2017. She strongly believes that data and knowledge are most valuable when they can be clearly communicated and understood. She currently manages a team of data scientists that bring new innovations to the Alteryx Platform.

Comments