Data Science

JeffA · ‎06-30-2021

Loved by analysts, data scientists, and software engineers, Jupyter notebooks are wonderfully interactive and portable. Their clever interface helps us visualize analytics, run machine learning models, and explore new possibilities. Hats off to Jupyter. It's made a lot of peoples' jobs easier.

Once you have a good Jupyter notebook (or Python script for that matter), what do you typically do with it? How easy is it to integrate into a broader process? An interactive process? For that matter, how easy is it to integrate any python script into a broader process? It's time to squeeze more value out of these powerful tools.

That's the job of the Jupyter Flow tool. This tool is available for Alteryx 2020.4 and later in the Laboratory. Download Jupyter Flow here.

Note: Check out the Jupyter Flow Basics Guide for step-by-step instructions on using the tool from scratch. For a more streamlined introduction to the tool, check out the Jupyter Flow Help Docs. Got a question? It might already have an answer in the Jupyter Flow FAQ.

Capabilities

The Jupyter Flow tool allows you to:

Run Jupyter notebooks written in a number of languages (Python, Julia, etc...)
Pass data into and out of Python-based Jupyter notebooks
Customize the python environments in which your notebooks run

Additionally, workflows containing Jupyter Flow tools can be exported, environment and all. This makes them:

Shareable
Server compatible

What can I do with it?

Schedule python work with dynamic I/O
Orchestrate multiple notebooks together
Manage integrated notebooks without Alteryx Designer
Store integrated notebooks/environments on a shared network drive
Create python-powered analytic apps

What uses can you think of?

To the Moon...err Europa!

Two philosophies drive the Jupyter Flow tool:

1. Let users leverage Jupyter in all its glory

2. Simplify environment packaging and sharing

Similarly, the two things you need to run the Jupyter Flow tool are:

1. A Jupyter notebook

2. A site-packages folder

Two inputs required to run a Jupyter Flow tool. Shown in the config pane.

After the first run (which, depending on the size of your site-packages folder, can be a wait - kind of like a trip to Jupiter), you will be gifted with a .pyz file. This is your ticket to sharing this workflow, running it on Server, and sharing environments (on a network perhaps?).

The .pyz file which unlocks sharing, Server runs, and more.

Switch off the package watcher if you'd like to share your workflow or run it on Server.

The packages toggle, which enables/disables environment building.

Now with a quick trip to the Export Workflow dialogue, we're ready to share this workflow with anyone who also has Jupyter Flow. And running the workflow on Server is as simple as working with any other workflow in Gallery.

Where's the Data?

Surely you'll want to pass Alteryx data into and out of your Jupyter notebook.

Jupyter Flow tries to noninvasively work with your notebooks. So reading data from or writing data to Alteryx involves the use of comments in the form of input and output tags.

There are four possible tags:

#ayx_input
#ayx_output
#ayx_input=
#ayx_output=

These tags are placed inside your notebook, above the data frame(s) you would like to replace with Alteryx data or output to Alteryx data. These tags do nothing when you're running your notebook outside of Alteryx. When Alteryx runs the notebook, however, it picks up on these tags and modifies the code to pass data in/out.

For example, you may have the following code in my Jupyter notebook as well as the following Jupyter Flow tool with available input connections #1 and #2:

lung_cancer_images = get_lung_cancer_images_dataframe()

In order to assign `lung_cancer_images` to an incoming Alteryx data stream, I could do the following:

#ayx_input=#2
lung_cancer_images = get_lung_cancer_images_dataframe()

Now when Alteryx runs the Jupyter Flow tool, `lung_cancer_images` will be set to the data coming in on connection #2 in the workflow shown above, instead of `get_lung_cancer_images_dataframe()`. Alteryx will generate and run a `_post_processed` version of the notebook (along with a path to that notebook in the workflow messages) which is the version of the notebook connected to Alteryx. You can see and debug this notebook (see instructions under Advanced Options or the help docs).

A similar approach applies to all ayx tags. Check out more details in the Jupyter Flow Help Docs.

Advanced Options

But wait, there's more! Jupyter Flow also helps you with:

Managing your Jupyter Flow generated environments
Debugging support - run through your notebook line by line using Alteryx data from a previous run
Data cache location configuration (for security or data management needs)

Manage Environments

You may enable custom zip app (.pyz file; the environment file) paths. This allows you to generate your .pyz environments one time, save them on a network drive or other shared location, and force the tool to use that environment. Find this options under the Advanced accordion:

Debug Notebooks

When Jupyter Flow runs your notebooks, it modifies a copy of them in the same directory the notebook exists. The copy will have `_post_processed` appended to its name. If you open this notebook, you can see what Jupyter Flow has done to enable data to flow into and out of the notebook. You can also run the notebook. However, for performance and security reasons, Jupyter Flow deletes its data caches by default. So to run these `_post_processed` notebooks, enable data cache backup under advanced options:

Configure Data Cache Location

For security and performance reasons, Jupyter Flow automatically deletes all of its data caches after each run. However, if data cache backup is enabled (for debugging or other purposes), these caches will stick around. When Jupyter Flow runs, the messages section of Designer will inform you of where this data is being cached. If you do not like its chosen cache location (the system's temp directory), you may change this by setting the data cache location to "custom" and typing a path to your desired location:

Integrate your notebooks with Alteryx using the Jupyter Flow tool and let us know what you think!

Banner image by Beate Bachman

MarqueeCrew · ‎07-02-2021

@JeffA ,

Let's face it Jeff, I'm an old dog. I learned with Basic, and took lessons in Cobol and C. Through osmosis I learned mainframe assembler (remember nothing) and was thrilled when I got to play with a toy named Alteryx. I'm Base-A. I avoid anything that resembles coding. That being said, I did play with R and Python when they were introduced in Alteryx. While I am a proficient RegEx user in Alteryx, I try not to use it when I'm in the presence of newbies (without caution).

I'd love to see a collection of "stuff you need python for". The first time I used python I was experimenting with SHA (Secure Hash Algorithms) encryption. Let's see some examples posted where functions that are easily consumed can be found in python and implemented in Alteryx. They, like CReW macros, serve a specific purpose and offer training opportunities. pandas.get_dummies and sha-### encryption would be nice to have access to.

Grow the use of python by demonstrating how easy it is to find ("Google") a function and make it ALTERYX-ready in a matter of minutes. Heck, write a macro (in Alteryx) that does it for you. That would be bonus points!!!

Cheers,

Mark

AJacobson · ‎07-03-2021

I believe this will really help people share their custom code and embed it in workflows. Great stuff Jeff!

SeanAdams · ‎12-10-2021

I really am so excited to play with this @JeffA - thank you for the innovative thinking!

When do you expect this to be included in the main product? If we can get this into the installer for a future version, then we can get this out to our user base!

(one of the challenges we have in a corporate environment is that additional downloads are tough to roll out, outside the main installer for Alteryx designer)