cancel
Showing results for 
Search instead for 
Did you mean: 

Data Science Blog

Machine learning & data science for beginners and experts alike.
EXTENDED Deadline - August 30th | You still have time to submit your Alteryx Use Case to qualify for this round of Analytics Excellence Awards presented at Inspire Europe 2018! Learn more here.
Sr. Community Content Manager
Sr. Community Content Manager

I recently bought Applied Text Analysis with Python - it's not finished yet, but O'Reilly emails me an updated PDF every time a new chapter is written.

It's good so far! I've actually only made it through the first chapter because I couldn't wait to see if I could use some of the code to build an Alteryx tool with the Python SDK.

In this post, I explain how I created a Gender Classification tool with the Python SDK based on example code in chapter 1 - Language and Computation - of Applied Text Analysis with Python. If you're not interested in the Python SDK, you may want to skip ahead to the Additional Resources section where I share the gender classification tool I built, and especially to the Analysis section where I surprised myself by using the tool to perform some genuinely interesting analysis on New York Times text.

 

Chapter 1: Language and Computation

 

The first snippets of code shared in the book identify pieces of text as either male or female (or both or unknown) based on the work of Neal Caren. I started off by following along with the example code by running it in my own Python notebook. The code leverages NLTK - the super popular open source Python library for natural language processing (NLP). Once I was happy enough that my results analyzing a New York Times article were close enough to those provided in the text, I turned my attention to Alteryx Designer.

 

# results in book
50.288% female (37 sentences)
42.016% unknown (49 sentences)
4.403% both (2 sentences)
3.292% male (3 sentences)
# my results
39.546% unknown (40 sentences)
51.785% female (34 sentences)
4.961% both (2 sentences)
3.709% male (3 sentences)

Not sure why my results are different - perhaps we're working off of different versions of the article - but close enough.

 

Getting Started with the SDK

 

The easiest way I know how to start making a new Alteryx Python tool is to copy one of the provided example tools, and modify the code from there. I want this tool to accept a single input (the text to analyze) and generate a single output (the gender scores) - therefore we start with the Python - Single Input Output tool. Downloading the YXI file and opening it with Alteryx Designer installs the tool to the C:\ProgramData\Alteryx\Tools folder (when you select the Install for all users option) - the tool appears in the Laboratory tool category in Designer.

Next, we'll duplicate the Python - Single Input Output subfolder within the Tools folder and rename it to what we want our new tool to be called - I've called mine ATAwP. We'll then need to modify the config.xml file and the names of the files to match. I also replaced the icon.

 

  <EngineSettings EngineDll="Python" EngineDllEntryPoint="Engine.py" SDKVersion="10.1" />
<GuiSettings Html="Gui.html" Icon="icon.png" Help="" SDKVersion="10.1">

Pertinent two lines of the config.xml file that need changing. Other lines further down can be updated to change tool metadata like the tool name, tool category, and description.

 

changed file names.pngChanged file names.

icon.pngThe new tool icon.

At this point, we have a new tool with a new name and a new icon, but it still does the exact same thing as the Python - Single Input Output example tool.

 

GUI

 

Here is what the example tool's interface looks like:
original interface.png


Pretty simple. But the only thing we really need for our tool is a dropdown to select a field to perform the gender classification on. So we'll go into the Gui.html file and remove all the stuff we don't need. We're left with a minuscule amount of code...

<label>XMSG("Select a field to analyze")</label>
<ayx data-ui-props = "{type: 'DropDown'}" data-item-props =
"{
dataName: 'FieldSelect',
dataType: 'FieldSelector',
anchorIndex:'0',
connectionIndex:'0'
}"
>
</ayx>

...that produces the following interface:

new interface.png

The takeaway here as it relates to the Python SDK is that when the user selects a field to analyze from the datastream going into the ATAwP tool, the name of the selected field is stored in an xml element named FieldSelect (named according to the dataName in the Gui.html snippet above). The xml is then passed to the Python script.

xml configuration.pngThe tool's user configuration is stored in XML and available to the Python script.

Virtual Environment

 

We know that our tool's Python script is going to rely on NLTK. We also know that we're going to want to share this tool with other people. We could manually add NLTK to the Python distribution included with Alteryx and the tool would work on our machine, but then would fail on any machine that didn't go through the same manual NLTK installation process. To solve this issue, the Python SDK has recently been enhanced with the ability to leverage Python virtual environments. The documentation turned out to be quite easy to follow - it was a quick 2 step process. First, create the virtual environment:

C:\Program Files\Alteryx\bin\Miniconda3>python -m venv C:\ProgramData\Alteryx\Tools\ATAwP

Then, install NLTK:

C:\ProgramData\Alteryx\Tools\ATAwP\Scripts>pip install nltk

virtual environment setup.pngSee - easy!
Now NLTK will be available to our tool's Python script (and after some packaging later on, available to the tool when installed on other people's machines) and we can move on to...

 

Python!

 

The first step is to add the working code from the notebook to the beginning of the Engine.py script. This essentially becomes lines 8-73.

Next, I created a bunch of variables to keep track of the new outgoing field names, types, and contents (as well as the incoming field contents).

I then removed code related to the sorting functionality of the copied tool that we no longer need from:

  • parts of the pi_init method
  • parts of the pi_add_incoming_connection method
  • the entire build_sort_info function
  • a couple of other places

The meat of the code changes as they relate to interacting with the Alteryx engine occur in ii_init, where we inform the engine of the field metadata that will be coming out of the tool; and in ii_push_record, where we actually call the parse_gender function (with the incoming data as the argument) that we got from the book to populate the outgoing data.

That's pretty much all there is to it! You can look at the detailed changes from the Python - Single Input Output tool to the ATAwP tool here. This view highlights additions in green, highlights subtractions in red, and collapses most of the parts that are unchanged.

 

Packaging the Tool for Distribution

 

Now that the tool is done and working on my machine, time to package it up so we can share it with others! Step one is to create the requirements.txt file. When someone installs the tool, this file tells Alteryx (and Python) what libraries need to be installed.

 

C:\ProgramData\Alteryx\Tools\ATAwP\Scripts>pip freeze > ..\requirements.txt

In our case the contents of the file look like this (nltk depends on six):

nltk==3.2.5
six==1.11.0

Now we copy over this new requirements.txt file, along with the core files in the root tool folder (not the files that were automatically generated during the creation of the virtual environment), into a new folder. Then follow the instructions here for creating a YXI file. In the end, we have this folder structure:

  • ATAwP.yxi (a zip archive renamed from .zip)
    • Config.xml
    • ATAwP
      • ATAwPConfig.xml
      • Engine.py
      • Gui.html
      • icon.png
      • requirements.txt

 

zip to yxi.gif

 

Share the YXI file. When someone double clicks on it, it will get automatically installed to their Alteryx Designer toolbar, and the dependent Python libraries (like NLTK) will get installed as well!

 

Analysis

 

I recently came across Taylor Cox's (@Coxta45) gorgeous New York Times connector, so when it came time to test my new tool, I knew just how to collect the data. The tool uses the Times' Top Stories API, which when I ran it on March 20 pulled 729 stories, mostly from the previous week. I used the ATAwP tool to gender classify the abstract returned by the API.

 

analysis workflow.png

 

The analysis showed that the obituaries section was the most male-dominated section in the paper over this period of time. After manually counting, I confirmed that 24 of 29 obituaries in this time frame (2018-03-13 to 2018-03-20) were for men. In a pure coincidence, this reminded me that on March 8th the New York Times noted that women have been historically underrepresented in their obituaries. While I hadn't set out to analyze the obituaries, the stark nature of the results led me down that path, and it turns out that the Times has performed their own comprehensive analysis.

 

I've attached the analysis workflow to this post. You'll need to download and install the ATAwP tool (see next section) before opening the workflow.

 

Additional Resources

 

  • Download the ATAwP tool that this post discusses here, then install it by opening the file in Designer (requires version 2018.1.4+).
  • The ATAwP GitHub repo
  • The developer community: developers.alteryx.com. Here you're linked to the documentation and can discuss the tool building process with your peers.
  • Taylor's GitHub page
  • Install New York Times connector from Gallery
  • And here's a blog post I came across recently (and here's an English translation).

Next Steps

 

Obviously, we've only scratched the surface (I've only made it through one chapter). So far we've used NLTK, but as you can see in the table below, there are several other open source packages we can use for text analysis.

 

ATAwP Table 1-1.pngTable 1-1. NLP Tools in Python from Applied Text Analysis with Python

Final Thoughts

 

It took me a few hours to go through the first chapter of the book and get the code working in my notebook, a few more hours to create the tool with the Python SDK. But then using the NYTimes Connector and the new gender analysis tool, it took only minutes to generate an insightful analysis. And I think that's a very fitting way to describe the benefit of extending Alteryx Designer by creating new tools: a little bit of extra upfront effort can pay dividends down the line for your organization (or the entire Alteryx Community when you share tools like Taylor does) in terms of increased productivity.

Lastly - have you heard about Alteryx BUILD? Feel free to use this tool as a starting point for your project!

Comments
Alteryx
Alteryx

 @NeilR This is awesome, thanks for sharing - learned a lot from it!

Thank you Neil and Nick for your blog posts. I have got started with the Python SDK! Smiley Very Happy

Wine, Whiskey and creativity is in my future. Smiley Very Happy

Cheers guys. S*

Sr. Community Content Manager
Sr. Community Content Manager

And I am already reading....