- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Notify Moderator
If you read my Word2Vec article from a couple months ago, you may have deduced I’ve been dabbling with the wild world of Natural Language Processing in Python. One of the NLP models I’ve trained using the Community corpus is a bigram Phrase (collocation) detection model using the Gensim Python library. Phrase detection models are neat because they find common phrases in your text (e.g., “Alteryx Server” or “Formula tool”) based on how frequently a pair of words occur together relative to how frequently the words occur independently in the text. Often, phrases have a different meaning than a combination of the individual words they are made up of.
After training a Phrase model, you can apply it to a set of words (i.e., a document) to convert identified phrases into single tokens by adding an underscore between the two words. This means when you tokenize your document, the phrases will become their own entities, which is handy when you are doing processes like word count, word matching, or really any type of text mining.
The default equation used to determine bigrams in the Gensim Phrases()
function is the same one Mikolov et al. proposed in their paper Distributed Representations of Words and Phrases and their Compositionality.
After training a Phrases model with Community texts, I wanted to be able to incorporate the model into Alteryx workflows that I was using to process text, and hopefully even be able to share the model with other Alteryx users. After thinking through this, I realized it was a perfect application for the Python SDK.
Full disclosure, I was quickly enchanted by the Python SDK when it was first released. It feels very cool to be able to develop custom Alteryx tools with Python code. It’s a creative and open way to extend the Alteryx platform. The following is a very simple example of how to use the SDK to deploy a trained model. I am really excited to see what you all do to expand upon it. On that same note, If you need help getting started more generally with the Python SDK, please take a look at @NickJ's article Levelling Up: A Beginner’s Guide to the Python SDK in Alteryx, or @NeilR's Text Analysis in Alteryx with the Python SDK: Gender Classification. Both of these articles are priceless for getting started with the SDK. In this post, I will be skipping over a lot of important content explicitly covered in these two blogs, so don't hesitate to refer back to them if you get lost. As both Nick and Neil mention, the easiest way to work with the SDK is to modify a pre-existing SDK example tool. In this scenario, the Single Input Output tool is a fabulous starting point.
As you may or may not know, the Python SDK processes data a single record (row) at a time. This means that the SDK will bring in a Row of data from an Alteryx data stream, perform a process on it, and output a single row. The SDK will then perform the same process on the next row of data. Although this method may not be ideal for training a model, where you (typically) need access to all of your rows of data at once, it’s totally amenable for creating something like a Score tool.
The Alteryx Score tool is clever because it is somewhat model-agnostic. It is able to do this by identifying the type of model being fed into the O anchor and using the appropriate corresponding code to create predictions with it. This is not what we will be building today, but it could be something you leverage this article as a starting point for. What we will be building out today is effectively a text data pre-processing tool, that strips punctuation, makes all letters lowercase, and creates bigrams.
The great trick in all of this is shipping our pre-trained Phrases model with the SDK tool so that when it is shared and installed on other computers, it is able to run successfully. This means placing a copy of the saved model file into the tool's folder, and including it in the folder when creating the .yxi, which will have a folder structure like this (after renaming the Example Tool components):
- Config.xml
- Phraser
- PhraserConfig.xml
- Phraser_Engine.py
- Phraser_Gui.html
- Phraser_Icon.png
- requirements.txt
- bigram.model
The interface for this tool is really generic - it is a simple drop-down widget used for field selection. The HTML file for our tool (used to specify the configuration options) looks like this:
<!DOCTYPE html> <html style="padding:20px"> <head> <meta charset="utf-8"> <title>Word to Vector</title> <script type="text/javascript"> document.write('<link rel="import" href="' + window.Alteryx.LibDir + '2/lib/includes.html">'); </script> </head> <body> <label>XMSG("Select Text Field to Convert")</label> <ayx data-ui-props="{type:'DropDown'}" data-item-props="{dataName:'FieldSelect', dataType:'FieldSelector'}"></ayx> </body> </html>
And results in a Configuration Window with a single drop-down field selector.
The first step is for the Python script to load in the necessary packages. We add this at the start of the SDK script.
from gensim.parsing.preprocessing import remove_stopwords
import gensim
import numpy as np
import re
import os
Next, we can modify the __init__ and pi_init sections of the SDK code to create an output field large enough for our texts, and make sure our field selection (FieldSelector) is getting read in for the input.
def __init__(self, n_tool_id: int, alteryx_engine: object, output_anchor_mgr: object): """ Constructor is called whenever the Alteryx engine wants to instantiate an instance of this plugin. :param n_tool_id: The assigned unique identification for a tool instance. :param alteryx_engine: Provides an interface into the Alteryx engine. :param output_anchor_mgr: A helper that wraps the outgoing connections for a plugin. """ # Default properties self.n_tool_id = n_tool_id self.alteryx_engine = alteryx_engine self.output_anchor_mgr = output_anchor_mgr self.output = "phrases" self.output_type = Sdk.FieldType.wstring self.output_size = 10000 def pi_init(self, str_xml: str): """ Handles building out the sort info, to pass into pre_sort() later on, from the user configuration. Called when the Alteryx engine is ready to provide the tool configuration from the GUI. :param str_xml: The raw XML from the GUI. """ if Et.fromstring(str_xml).find('FieldSelect') is not None: self.field_selection = Et.fromstring(str_xml).find('FieldSelect').text else: self.alteryx_engine.output_message(self.n_tool_id, Sdk.EngineMessageType.error, 'Please select field to analyze') self.alteryx_engine.output_message(self.n_tool_id, Sdk.EngineMessageType.info, self.field_selection) self.output_anchor = self.output_anchor_mgr.get_output_anchor( 'Output') # Getting the output anchor from the XML file.
Then, we skip to the ii_push_record section, which is responsible for sending the records out and is called when an input record is being sent to the tool, and is where the bulk of our processing code goes. Thinking about this in the context of row-by-row processing, this part of the SDK is called repeatedly for each input row of data, where it processes these rows one at a time.
The bulk of our custom code is positioned after a conditional statement that makes sure the incoming record is not empty.
def ii_push_record(self, in_record: object) -> bool: """ Responsible for pushing records out Called when an input record is being sent to the plugin. :param in_record: The data for the incoming record. :return: False if method calling limit (record_cnt) is hit. """ # Copy the data from the incoming record into the outgoing record. self.record_creator.reset() self.record_copier.copy(self.record_creator, in_record) if self.parent.input_field.get_as_string(in_record) is not None:
First, we need to define our file directory as a variable, which we are able to do with the os library. This is what makes it possible to dynamically find the location of the bigram on different workstations. We then use our dirname variable to load our Phrases model.
dirname = os.path.dirname(__file__)
bigram = gensim.models.Phrases.load(os.path.join(dirname, 'bigram.model'))
Once our model is loaded and ready to go, we can load in a row of input data.
# load in text data as string line = self.parent.input_field.get_as_string(in_record)
The Phrases model is expecting a list of words as an input. We can convert our string input to this format by split()
-ing on whitespace characters in a list comprehension. Also at this time, we can make all of our characters lowercase, and strategically strip a lot of the punctuation with a series of fancy (or poorly written- it's hard to know sometimes) regex statements.
# parse and strip punctuation words = [x for x in re.sub('\s_\s', " ", re.sub(r'[\.]', " ", re.sub(r'[\s+]', " ", re.sub(r'[-]', "_",
re.sub(r'[\"\'\|,:@#?!$"%&()*+=><^~`{};/@]', "", line.lower()))))).split() if x != " "]
Now that we have a cleaned up list of words, we can run it through our bigram phrases model.
# convert list of strings to include bigrams
bigrams = bigram[words]
To be able to write this data out as a single field, we need to join our new list of words back together.
output = ' '.join(bigrams)
And then we can write them out, returning to the generic SDK code. The rest of the SDK code should be good to go!
self.parent.output.set_from_string(self.record_creator, output)
out_record = self.record_creator.finalize_record()
That's pretty much it! This tool will read in the model included in the file directory using the os library. When you package this tool to a .yxi file, it will include the model and read the model in the same way on whatever machine the tool is installed on.
Here is the tool in action:
Before Text Processing:
After Text Processing:
We can tokenize the after text using the spaces and a Text to Columns tool, and then do a simple word count with a Summarize tool.
Filtering out the Stop Words (which you can do with an actual Filter tool, a Join tool with a Stop Words input, or add the functionality to your SDK tool 🙂 ) We see what some of the top words and phrases in our Designer KB are.
As I mentioned at the beginning, this is a really simple example of embedding a model in a Python SDK tool, but the same idea could be leveraged for many different applications. I am so excited to see what you all come up with!
If you would like to have a copy of the tool for your very own (to use or take apart, whatever works), you can download it here.
A geographer by training and a data geek at heart, Sydney joined the Alteryx team as a Customer Support Engineer in 2017. She strongly believes that data and knowledge are most valuable when they can be clearly communicated and understood. She currently manages a team of data scientists that bring new innovations to the Alteryx Platform.
A geographer by training and a data geek at heart, Sydney joined the Alteryx team as a Customer Support Engineer in 2017. She strongly believes that data and knowledge are most valuable when they can be clearly communicated and understood. She currently manages a team of data scientists that bring new innovations to the Alteryx Platform.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.