Data Science

NickJ · ‎05-09-2018

As an Alteryx associate who’s grown with the company from the technical pre-sales side, you’d expect that I’d be pretty handy at building a workflow or two. I still remember building my first macro and getting a little buzz when I realised that I could re-use and share this with others, and it just worked.

Fast forward a few months, and another buzz when I discovered the Gallery API – build an analytic app in Alteryx Designer and – hey presto! – an instant analytic service that could be called like any web service. We had some real fun playing with these API end-points – putting them inside smart devices like the Bttn and Amazon’s Alexa!

Jump to 2017, and as Alteryx acquired the technology and skills of YHat we developed Alteryx Promote – giving me access to a near-realtime API that I could use to serve up predictive model responses: everything from a fast scoring, to a next-best action, to text analysis – all through a simple to implement REST/JSON format.

But there’s always been one area of the product that I’ve shied away from: The Alteryx Engine and its Software Developer Kit (SDK). Traditionally, this has been the domain of the best-and-brightest in Alteryx and beyond (James Dunkerley – I’m looking at you here!) and required a certain level of mastery in C++ to be able to talk the Alteryx language.

I’ve been watching the development of the Python SDK with great interest because, as a self-confessed amateur coder, Python, R and SQL are probably the limits of my coding aspirations. (I’m currently learning ‘Just enough Javascript to be Dangerous’ – because JavaScript/Node seems to be everywhere these days, and who doesn’t want to be a little dangerous….?)

Knowing just enough Python to get started, the Python SDK is my gateway to this final frontier – talking efficiently and directly to the Alteryx Engine and building new high-performance, sharable tools that can use so many great open-source Python libraries as a genuine complement to Alteryx’s existing R-based tools.

So, I jumped into the Python SDK documentation with great excitement… and almost immediately got stuck. How disappointing! I found the engine terminology confusing and I couldn’t progress beyond the ‘hello world’ basics of the initial samples.

Then, just a week ago Neil Ryan released a really powerful guide as part of the Community’s Data Science Blog – this reignited my desire to crack the Python SDK and so (with Neil’s code and personal expertise) I’ve now developed my first simple Python-based SDK tool, and I’d like to share this with you in this blog. I’m going to cover all the steps I took so that hopefully you can replicate or enhance the code, or just take it in any direction you wish!

What Shall We Build Today?

A new project requires a challenge! I really liked Neil’s example of Text Analytics in the blog post, and I want to take this both one step further (in terms of content) and one step backward (in terms of simplicity)!

I’d like to use a Python module called ‘newspaper3k’ to run article summarisation on a supplied URL – that is, I give you a URL and you analyse the text behind the link, and return to me the most important five sentences within the article. I love article summarisers – I can get the gist of a page without having to read the entire document, and I would love to have an article summary tool in Alteryx so that I can automate this process!

First Steps

A great starting place is to download the SDK samples from https://github.com/alteryx/python-sdk-samples - focus initially on the Python – Single Input Output example: this contains everything we need for our first basic tool.

Copy this directory to your local machine, and create a folder structure as follows:

Rename your copied folder with the name of your tool (in my case, Article)
Within the copied folder, rename Python - Single Input OutputConfig.xml to the name of your tool, with ‘Config’ at the end, such as ArticleConfig.xml
Rename Python - Single Input OutputGui.html to Article_GUI.html
Remove the Python - Single Input OutputIcon.png file (we’ll be getting our own in a moment)
Remove the Python - Single Input OutputEngine.py python file – we’ll be building up a new, simpler Python file in the remainder of this blog.
Remove the language-specific config files (unless you have a desire to keep/alter them) - Python - Single Input OutputConfig.fr.yxlang, Python - Single Input OutputConfig.de.yxlang, Python - Single Input OutputConfig.xx.yxlang

Go and grab an image for your new tool. There are plenty of sites that offer free icon sets (personally, I use http://iconapp.io/ and https://iconmonstr.com/ ) – save your chosen icon in png format into the Article folder with the name Article_Icon.png.

Next, create a brand-new empty file in an editor of your choice (since we’re writing Python code, you might want to pick an editor that handles python code formatting automatically – Python is especially picking over indentation) and save this as Article_Engine.py.

That’s the first part of the process complete – we’re ready to start customising our tool!

Configuration before Coding

Let’s jump into our ArticleConfig.xml file – this tells Alteryx the purpose of all the files we’ve just copied or created.

We’ll be making changes to the following sections in yellow: simply change the file names to the ones we created in the previous section, and update the MetaInfo to contain a good description of the tool we’re building!

For those of you wanting a cut-and-paste, here’s the XML code below:

<?xml version="1.0"?>
<AlteryxJavaScriptPlugin>
  <EngineSettings EngineDll="Python" EngineDllEntryPoint="Article_Engine.py" SDKVersion="10.1" />
  <GuiSettings Html="Article_GUI.html" Icon="Article_Icon.png" Help="https://help.alteryx.com/developer/current/index.htm#Python/Examples.htm" SDKVersion="10.1">
    <InputConnections>
      <Connection Name="Input" AllowMultiple="False" Optional="False" Type="Connection" Label=""/>
    </InputConnections>
    <OutputConnections>
      <Connection Name="Output" AllowMultiple="False" Optional="False" Type="Connection" Label=""/>
    </OutputConnections>
  </GuiSettings>
  <Properties>
    <MetaInfo>
      <Name>Python - Article Summary (Newspaper3k)</Name>
      <Description>Returns the most relevant sentences from a supplied URL.</Description>
      <ToolVersion>1.1</ToolVersion>
      <CategoryName>Laboratory</CategoryName>
      <SearchTags>python, sdk, text analytics, text, nlp, python sdk</SearchTags>
      <Author>Nick Jewell</Author>
      <Company>Alteryx, Inc.</Company>
      <Copyright>2018</Copyright>
    </MetaInfo>
  </Properties>
</AlteryxJavaScriptPlugin>

Designing the Interface

Some tools will have more complex user interfaces than others. This blog covers a pretty basic interface where the user selects a single field for text analysis, so our GUI file is going to be really simple. (Which is good for us as we’re learning!)

Open the Article_GUI.html file and reduce the code down to the following:

<!DOCTYPE html>
<html style="padding:20px">
<head>
  <meta charset="utf-8">
  <title>Article Summary</title>
  <script type="text/javascript">
    document.write('<link rel="import" href="' + window.Alteryx.LibDir + '2/lib/includes.html">');
  </script>
</head>
<body>
  <label>XMSG("Select a field containing a URL to analyze")</label>
    <ayx data-ui-props = "{type: 'DropDown'}" data-item-props =
      "{
        dataName: 'FieldSelect',
        dataType: 'FieldSelector',
        anchorIndex:'0',
        connectionIndex:'0'
      }"
    >
    </ayx>
  <script type="text/javascript">                     
    Alteryx.Gui.BeforeLoad = (manager, AlteryxDataItems, json) => {
    }
    Alteryx.Gui.AfterLoad = (manager) => {
    }
  </script>
 
</body>
</html>

All we need to care about in this code is that it uses the Alteryx JavaScript SDK to create a drop-down that inherits the field names from the data-stream and lets the user choose one of these fields. See below for an action screenshot:

Save this HTML file and that’s our configuration complete. We’re now ready to start tackling the python part of our project!

Prototyping Python

I’ve generally found that a browser-based Python environment such as Anaconda’s Jupyter Notebook is the most conducive to rapid iteration and testing of code – your mileage may vary, but choose an environment where you can test out your custom code before inserting it into the Alteryx SDK. This approach may save you many hours of wrangling python errors!

For example, in a Jupyter notebook I’ve sketched the following functionality in just a few lines:

The very first command !pip install newspaper3k ensures that the newspaper3k library is installed into my Python environment (it’s not a standard part of the Anaconda or Alteryx distributions).

I then import the Article functionality from the newspaper module (line 2), supply a URL (line 22) and proceed to download, parse and analyse the text behind the URL according to the module’s documentation (lines 23-26).

Finally, in line 27, I produce a 5-sentence summary of the article, delimited by the newline character (\n). This is the information that I’d like to bring back into Alteryx Designer for further analysis and blending.

Once you’re happy that you have some working code, let’s step into a text editor/python editor and begin to make changes to our Article_Engine.py file – the core of our Python SDK work.

Talking to the Alteryx Engine in Python

In this section, I’ll break down the code section-by-section, explaining where I’m making additions (and why). All of the code in this section lives in the Article_Engine.py file.

"""
AyxPlugin (required) has-a IncomingInterface (optional).
Although defining IncomingInterface is optional, the interface methods are needed if an upstream tool exists.
"""
import AlteryxPythonSDK as Sdk
import xml.etree.ElementTree as Et
import nltk
nltk.download('punkt')
from newspaper import Article

In these opening lines, we’re making sure that we have access to Python’s Natural Language Toolkit (NLTK), a specific document corpus (‘punkt’) and the newspaper module that we tested in the previous section.

class AyxPlugin:
    """
    Implements the plugin interface methods, to be utilized by the Alteryx engine to communicate with a plugin.
    Prefixed with "pi", the Alteryx engine will expect the below five interface methods to be defined.
    """
 
    def __init__(self, n_tool_id: int, alteryx_engine: object, output_anchor_mgr: object):
        """
        Constructor is called whenever the Alteryx engine wants to instantiate an instance of this plugin.
        :param n_tool_id: The assigned unique identification for a tool instance.
        :param alteryx_engine: Provides an interface into the Alteryx engine.
        :param output_anchor_mgr: A helper that wraps the outgoing connections for a plugin.
        """
 
        # Default properties
        self.n_tool_id = n_tool_id
        self.alteryx_engine = alteryx_engine
        self.output_anchor_mgr = output_anchor_mgr
 
        # Custom properties
 
        self.summary = "article_summary"
        self.summary_type = Sdk.FieldType.string
        self.summary_size = 1000

Our custom properties here include a reference to the ‘summary’ property that will represent our output from the tool. In these three lines, we’ve defined an output field called ‘article summary’ that’s a string and has a maximum size of 1000 characters.

    def pi_init(self, str_xml: str):
        """
        Handles building out the sort info, to pass into pre_sort() later on, from the user configuration.
        Called when the Alteryx engine is ready to provide the tool configuration from the GUI.
        :param str_xml: The raw XML from the GUI.
        """
 
        if Et.fromstring(str_xml).find('FieldSelect') is not None:
            self.field_selection = Et.fromstring(str_xml).find('FieldSelect').text
        else:
            self.alteryx_engine.output_message(self.n_tool_id, Sdk.EngineMessageType.error, 'Please select field to analyze')
 
        self.alteryx_engine.output_message(self.n_tool_id, Sdk.EngineMessageType.info, self.field_selection)
                     
        self.output_anchor = self.output_anchor_mgr.get_output_anchor('Output')  # Getting the output anchor from the XML file.

In this section, we’re asking the plugin interface (‘pi’) for the field to analyse, and storing the value into the field_selection property for later use.

    def pi_add_incoming_connection(self, str_type: str, str_name: str) -> object:
        """
        The IncomingInterface objects are instantiated here, one object per incoming connection, also pre_sort() is called here.
        Called when the Alteryx engine is attempting to add an incoming data connection.
        :param str_type: The name of the input connection anchor, defined in the Config.xml file.
        :param str_name: The name of the wire, defined by the workflow author.
        :return: The IncomingInterface object(s).
        """
 
        self.single_input = IncomingInterface(self)
        return self.single_input
 
    def pi_add_outgoing_connection(self, str_name: str) -> bool:
        """
        Called when the Alteryx engine is attempting to add an outgoing data connection.
        :param str_name: The name of the output connection anchor, defined in the Config.xml file.
        :return: True signifies that the connection is accepted.
        """
 
        return True
    def pi_push_all_records(self, n_record_limit: int) -> bool:
        """
        Called when a tool has no incoming data connection.
        :param n_record_limit: Set it to <0 for no limit, 0 for no records, and >0 to specify the number of records.
        :return: True for success, False for failure.
        """
 
        self.alteryx_engine.output_message(self.n_tool_id, Sdk.EngineMessageType.error, self.xmsg('Missing Incoming Connection'))
        return False
 
    def pi_close(self, b_has_errors: bool):
        """
        Called after all records have been processed..
        :param b_has_errors: Set to true to not do the final processing.
        """
 
        self.output_anchor.assert_close()  # Checks whether connections were properly closed.

This section has been left as per the default GitHub code for incoming/outgoing connections to the plugin, and error-handling/closing the connection to the plugin.

class IncomingInterface:
    """
    This optional class is returned by pi_add_incoming_connection, and it implements the incoming interface methods, to
    be utilized by the Alteryx engine to communicate with a plugin when processing an incoming connection.
    Prefixed with "ii", the Alteryx engine will expect the below four interface methods to be defined.
    """
 
    def __init__(self, parent: object):
        """
        Constructor for IncomingInterface.
        :param parent: AyxPlugin
        """
 
        # Default properties
        self.parent = parent
 
        # Custom properties
        self.record_copier = None
        self.record_creator = None

The incoming interface class handles the Alteryx Engine’s interactions with the plugin, and this is where most of our code will be placed. We have to make changes to the GitHub code in order to specify the fields that get processed on a row-by-row basis. We set up these definitions in the ii_init() function, below:

    def ii_init(self, record_info_in: object) -> bool:
        """
        Called to report changes of the incoming connection's record metadata to the Alteryx engine.
        :param record_info_in: A RecordInfo object for the incoming connection's fields.
        :return: True for success, otherwise False.
        """
 
        # Returns a new, empty RecordCreator object that is identical to record_info_in.
        record_info_out = record_info_in.clone()
 
        # Adds field to record with specified name and output type.
        #record_info_out.add_field(self.parent.out_name, self.parent.out_type, self.parent.out_size)
 
        record_info_out.add_field(self.parent.summary, self.parent.summary_type, self.parent.summary_size)
 
        # Lets the downstream tools know what the outgoing record metadata will look like, based on record_info_out.
        self.parent.output_anchor.init(record_info_out)
 
        # Creating a new, empty record creator based on record_info_out's record layout.
        self.record_creator = record_info_out.construct_record_creator()
 
        # Instantiate a new instance of the RecordCopier class.
        self.record_copier = Sdk.RecordCopier(record_info_out, record_info_in)
 
        # Map each column of the input to where we want in the output.
        for index in range(record_info_in.num_fields):
            # Adding a field index mapping.
            self.record_copier.add(index, index)
 
        # Let record copier know that all field mappings have been added.
        self.record_copier.done_adding()
 
        # Grab the index of our new field in the record, so we don't have to do a string lookup on every push_record.
        #self.parent.out_field = record_info_out[record_info_out.get_field_num(self.parent.out_name)]
 
        self.parent.summary = record_info_out[record_info_out.get_field_num(self.parent.summary)]
       
        # Grab the index of our input field in the record, so we don't have to do a string lookup on every push_record.
        self.parent.input_field = record_info_out[record_info_out.get_field_num(self.parent.field_selection)]
 
        return True

In the lines highlighted in bold, we’re creating a record based on a ‘clone’ (copy) of the incoming fields, then adding our new summary field to the metadata at the end of the record. In Alteryx terms, this is like using a Formula tool to create a new field within a dataset.

Towards the end of this code block, we’re making sure that our fields are efficiently stored so that we don’t have to do unnecessary lookups as part of the processing.

    def ii_push_record(self, in_record: object) -> bool:
        """
        Responsible for pushing records out
        Called when an input record is being sent to the plugin.
        :param in_record: The data for the incoming record.
        :return: False if method calling limit (record_cnt) is hit.
        """
        # Copy the data from the incoming record into the outgoing record.
        self.record_creator.reset()
        self.record_copier.copy(self.record_creator, in_record)
                                               
        if self.parent.input_field.get_as_string(in_record) is not None:
            url = self.parent.input_field.get_as_string(in_record)
            article = Article(url)
            article.download()
            article.parse()
            article.nlp()
            result = article.summary
            self.parent.summary.set_from_string(self.record_creator, result)          
            out_record = self.record_creator.finalize_record()
 
        # Push the record downstream and quit if there's a downstream error.
        if not self.parent.output_anchor.push_record(out_record):
            return False
 
        return True

The ii_push_record() function is where the majority of our custom coding is placed. Our text analysis code is located inside an if() statement that checks for a row of data. We then execute the article summarisation and place the result back into the summary field that we create at the start of the code.

We call the finalize_record() function to send the record back to the user inside Alteryx Designer.

    def ii_update_progress(self, d_percent: float):
        """
        Called by the upstream tool to report what percentage of records have been pushed.
        :param d_percent: Value between 0.0 and 1.0.
        """
 
        self.parent.alteryx_engine.output_tool_progress(self.parent.n_tool_id, d_percent)  # Inform the Alteryx engine of the tool's progress.
        self.parent.output_anchor.update_progress(d_percent)  # Inform the downstream tool of this tool's progress.
 
    def ii_close(self):
        """
        Called when the incoming connection has finished passing all of its records.
        """
 
        self.parent.output_anchor.output_record_count(True)  # True: Let Alteryx engine know that all records have been sent downstream.
        self.parent.output_anchor.close()  # Close outgoing connections.

These final two functions (ii_update_progress() and ii_close()) are housekeeping functions that haven’t been altered from the GitHub template.

Configuring Virtual Environments for Easy Distribution

Since version 2018.1.4+ of Alteryx, there’s been a small change as to how python code can be distributed between users who want to share these types of tools, and it’s a two-step process.

Firstly, create a virtual environment for Python using the following command (this may require Admin access in order to write to the ProgramData folder):

C:\Program Files\Alteryx\bin\Miniconda3>python -m venv C:\ProgramData\Alteryx\Tools\Article

Then, we install the necessary modules into this virtual environment:

C:\ProgramData\Alteryx\Tools\Article\Scripts>pip install nltk
C:\ProgramData\Alteryx\Tools\Article\Scripts>pip install newspaper3k

(The second of these commands will also install a whole slew of supporting libraries)

Next, we’ll list out all the modules in this virtual environment and capture them in a requirements.txt file (which will be used by the Python SDK to replicate this setup for any additional users.

C:\ProgramData\Alteryx\Tools\Article\Scripts>pip freeze > ..\requirements.txt

Copy this requiremnts.txt file into your Article folder and it should look something like this:

beautifulsoup4==4.6.0
certifi==2018.4.16
chardet==3.0.4
cssselect==1.0.3
feedfinder2==0.0.4
feedparser==5.2.1
idna==2.6
jieba3k==0.35.1
lxml==4.2.1
newspaper3k==0.2.6
nltk==3.2.5
Pillow==5.1.0
python-dateutil==2.7.2
PyYAML==3.12
requests==2.18.4
requests-file==1.4.3
six==1.11.0
tldextract==2.2.0
urllib3==1.22

Final Configuration

In Windows Explorer, navigate one folder above your Article folder and create a file called Config.xml. This will be the master configuration file for your tool installer. Use the following code for this file:

<?xml version="1.0" encoding="UTF-8"?>
<AlteryxJavaScriptPlugin>
   <Properties>
      <MetaInfo>
         <Name>Article Summary</Name>
         <Description>Return a new-line delimted article summary from a supplied URL.</Description>
         <ToolVersion>1.1</ToolVersion>
         <CategoryName>Laboratory</CategoryName>
         <Author>Nick Jewell</Author>
         <Icon>Article\Article_Icon.png</Icon>
      </MetaInfo>
   </Properties>
</AlteryxJavaScriptPlugin>

Change the elements in bold as needed, and save. You should now have a folder structure that looks like this:

Config.xml
Article

ArticleConfig.xml
Article_Engine.py
Article_Gui.html
Article_Icon.png
requirements.txt

The only constraint around naming that I’ve found is that the ArticleConfig.xml file must be named consistently with the parent directory and must include the word Config without any spaces. So, a parent directory called ‘foo’ should have a config file named FooConfig.xml inside it.

Zipping to the Finish Line

Zip the Article directory and Config.xml files into a zip file called ‘Article.zip’, and then use the command line to rename the .zip extension to .yxi (Alteryx installer file type) as follows:

move Article.zip Article.yxi

You should see the icon change in Windows Explorer from a zipped folder to an Alteryx installer.

Running the Installer

If you double-click the installer, you’ll be requested to take action inside Alteryx Designer. The dialog box will look something like this:

Click Install, navigate to the Laboratory tab and drop the tool into a workflow to begin testing.

Testing the Tool in a Workflow

As part of testing the tool, create a simple workflow that uses a test URL and check that it returns results correctly – drop a text-to-columns tool after the custom tool to split based on the newline (\n) delimiter into rows for easy viewing:

With this input data, I receive the following output from my new tool:

(i.e., exactly the same as I get from my Jupyter notebook.) However, if there are any errors, you should receive reasonably good error messages from the Python SDK including which line of code is throwing the error.

Wow! What just happened?

In this rollercoaster tour of the Python SDK, we took a challenge to improve our text analytics tooling in the simplest way possible – we simplified all the steps to produce a new Alteryx tool to a bare minimum but introduced:

How to script a plugin GUI (so that a user can interact with your tool)
How to configure the tool’s internal files
How to code (and understand) a minimal Python SDK script
How to ensure portability with a virtual environment/requirements file
How to bring it all together and install!

Please let me know via the Comments section if any steps in this process aren’t clear, or if you’re finding errors. Otherwise, I wish you all happy trails with this great new functionality!

A huge thanks to Neil Ryan and the wider Alteryx Developer Community (developers.alteryx.com) for giving me the support I needed to be successful!