This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I recently bought Applied Text Analysis with Python - it's not finished yet, but O'Reilly emails me an updated PDF every time a new chapter is written.
It's good so far! I've actually only made it through the first chapter because I couldn't wait to see if I could use some of the code to build an Alteryx tool with the Python SDK.
In this post, I explain how I created a Gender Classification tool with the Python SDK based on example code in chapter 1 - Language and Computation - of Applied Text Analysis with Python. If you're not interested in the Python SDK, you may want to skip ahead to the Additional Resources section where I share the gender classification tool I built, and especially to the Analysis section where I surprised myself by using the tool to perform some genuinely interesting analysis on New York Times text.
The first snippets of code shared in the book identify pieces of text as either male or female (or both or unknown) based on the work of Neal Caren. I started off by following along with the example code by running it in my own Python notebook. The code leverages NLTK - the super popular open source Python library for natural language processing (NLP). Once I was happy enough that my results analyzing a New York Times article were close enough to those provided in the text, I turned my attention to Alteryx Designer.
# results in book
50.288% female (37 sentences)
42.016% unknown (49 sentences)
4.403% both (2 sentences)
3.292% male (3 sentences)
# my results
39.546% unknown (40 sentences)
51.785% female (34 sentences)
4.961% both (2 sentences)
3.709% male (3 sentences)
Not sure why my results are different - perhaps we're working off of different versions of the article - but close enough.
The easiest way I know how to start making a new Alteryx Python tool is to copy one of the provided example tools, and modify the code from there. I want this tool to accept a single input (the text to analyze) and generate a single output (the gender scores) - therefore we start with the Python - Single Input Output
tool. Downloading the YXI file and opening it with Alteryx Designer installs the tool to the C:\ProgramData\Alteryx\Tools
folder (when you select the Install for all users
option) - the tool appears in the Laboratory tool category in Designer.
Next, we'll duplicate the Python - Single Input Output
subfolder within the Tools folder and rename it to what we want our new tool to be called - I've called mine ATAwP
. We'll then need to modify the config.xml file and the names of the files to match. I also replaced the icon.
<EngineSettings EngineDll="Python" EngineDllEntryPoint="Engine.py" SDKVersion="10.1" />
<GuiSettings Html="Gui.html" Icon="icon.png" Help="" SDKVersion="10.1">
Pertinent two lines of the config.xml file that need changing. Other lines further down can be updated to change tool metadata like the tool name, tool category, and description.
Changed file names.
The new tool icon.
At this point, we have a new tool with a new name and a new icon, but it still does the exact same thing as the Python - Single Input Output
example tool.
Here is what the example tool's interface looks like:
Pretty simple. But the only thing we really need for our tool is a dropdown to select a field to perform the gender classification on. So we'll go into the Gui.html file and remove all the stuff we don't need. We're left with a minuscule amount of code...
<label>XMSG("Select a field to analyze")</label>
<ayx data-ui-props = "{type: 'DropDown'}" data-item-props =
"{
dataName: 'FieldSelect',
dataType: 'FieldSelector',
anchorIndex:'0',
connectionIndex:'0'
}"
>
</ayx>
...that produces the following interface:
The takeaway here as it relates to the Python SDK is that when the user selects a field to analyze from the datastream going into the ATAwP tool, the name of the selected field is stored in an xml element named FieldSelect
(named according to the dataName in the Gui.html snippet above). The xml is then passed to the Python script.
The tool's user configuration is stored in XML and available to the Python script.
We know that our tool's Python script is going to rely on NLTK. We also know that we're going to want to share this tool with other people. We could manually add NLTK to the Python distribution included with Alteryx and the tool would work on our machine, but then would fail on any machine that didn't go through the same manual NLTK installation process. To solve this issue, the Python SDK has recently been enhanced with the ability to leverage Python virtual environments. The documentation turned out to be quite easy to follow - it was a quick 2 step process. First, create the virtual environment:
C:\Program Files\Alteryx\bin\Miniconda3>python -m venv C:\ProgramData\Alteryx\Tools\ATAwP
Then, install NLTK:
C:\ProgramData\Alteryx\Tools\ATAwP\Scripts>pip install nltk
See - easy!
Now NLTK will be available to our tool's Python script (and after some packaging later on, available to the tool when installed on other people's machines) and we can move on to...
The first step is to add the working code from the notebook to the beginning of the Engine.py script. This essentially becomes lines 8-73.
Next, I created a bunch of variables to keep track of the new outgoing field names, types, and contents (as well as the incoming field contents).
I then removed code related to the sorting functionality of the copied tool that we no longer need from:
pi_init
methodpi_add_incoming_connection
methodbuild_sort_info
functionThe meat of the code changes as they relate to interacting with the Alteryx engine occur in ii_init
, where we inform the engine of the field metadata that will be coming out of the tool; and in ii_push_record
, where we actually call the parse_gender
function (with the incoming data as the argument) that we got from the book to populate the outgoing data.
That's pretty much all there is to it! You can look at the detailed changes from the Python - Single Input Output
tool to the ATAwP tool here. This view highlights additions in green, highlights subtractions in red, and collapses most of the parts that are unchanged.
Now that the tool is done and working on my machine, time to package it up so we can share it with others! Step one is to create the requirements.txt
file. When someone installs the tool, this file tells Alteryx (and Python) what libraries need to be installed.
C:\ProgramData\Alteryx\Tools\ATAwP\Scripts>pip freeze > ..\requirements.txt
In our case the contents of the file look like this (nltk
depends on six
😞
nltk==3.2.5
six==1.11.0
Now we copy over this new requirements.txt
file, along with the core files in the root tool folder (not the files that were automatically generated during the creation of the virtual environment), into a new folder. Then follow the instructions here for creating a YXI file. In the end, we have this folder structure:
Share the YXI file. When someone double clicks on it, it will get automatically installed to their Alteryx Designer toolbar, and the dependent Python libraries (like NLTK) will get installed as well!
I recently came across Taylor Cox's (@Coxta45) gorgeous New York Times connector, so when it came time to test my new tool, I knew just how to collect the data. The tool uses the Times' Top Stories API, which when I ran it on March 20 pulled 729 stories, mostly from the previous week. I used the ATAwP tool to gender classify the abstract returned by the API.
The analysis showed that the obituaries section was the most male-dominated section in the paper over this period of time. After manually counting, I confirmed that 24 of 29 obituaries in this time frame (2018-03-13 to 2018-03-20) were for men. In a pure coincidence, this reminded me that on March 8th the New York Times noted that women have been historically underrepresented in their obituaries. While I hadn't set out to analyze the obituaries, the stark nature of the results led me down that path, and it turns out that the Times has performed their own comprehensive analysis.
I've attached the analysis workflow to this post. You'll need to download and install the ATAwP tool (see next section) before opening the workflow.
Obviously, we've only scratched the surface (I've only made it through one chapter). So far we've used NLTK, but as you can see in the table below, there are several other open source packages we can use for text analysis.
Table 1-1. NLP Tools in Python from Applied Text Analysis with Python
It took me a few hours to go through the first chapter of the book and get the code working in my notebook, a few more hours to create the tool with the Python SDK. But then using the NYTimes Connector and the new gender analysis tool, it took only minutes to generate an insightful analysis. And I think that's a very fitting way to describe the benefit of extending Alteryx Designer by creating new tools: a little bit of extra upfront effort can pay dividends down the line for your organization (or the entire Alteryx Community when you share tools like Taylor does) in terms of increased productivity.
Lastly - have you heard about Alteryx BUILD? Feel free to use this tool as a starting point for your project!
Neil Ryan (he/him) is the Sr Manager, Community Content, responsible for the content in the Alteryx Community. He held previous roles at Alteryx including Advanced Analytics Product Manager and Content Engineer, and had prior gigs doing fraud detection analytics consulting and creating actuarial pricing models. Neil's industry experience and technical skills are wide ranging and well suited to drive compelling content tailored for Community members to rank up in their careers.
Neil Ryan (he/him) is the Sr Manager, Community Content, responsible for the content in the Alteryx Community. He held previous roles at Alteryx including Advanced Analytics Product Manager and Content Engineer, and had prior gigs doing fraud detection analytics consulting and creating actuarial pricing models. Neil's industry experience and technical skills are wide ranging and well suited to drive compelling content tailored for Community members to rank up in their careers.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.