Data Science

Machine learning & data science for beginners and experts alike.
JeffA
Alteryx Alumni (Retired)

Loved by analysts, data scientists, and software engineers, Jupyter notebooks are wonderfully interactive and portable. Their clever interface helps us visualize analytics, run machine learning models, and explore new possibilities. Hats off to Jupyter. It's made a lot of peoples' jobs easier.

 

Once you have a good Jupyter notebook (or Python script for that matter), what do you typically do with it? How easy is it to integrate into a broader process? An interactive process? For that matter, how easy is it to integrate any python script into a broader process? It's time to squeeze more value out of these powerful tools.

 

That's the job of the Jupyter Flow tool. This tool is available for Alteryx 2020.4 and later in the Laboratory. Download Jupyter Flow here.

 

Note: Check out the Jupyter Flow Basics Guide for step-by-step instructions on using the tool from scratch. For a more streamlined introduction to the tool, check out the Jupyter Flow Help Docs. Got a question? It might already have an answer in the Jupyter Flow FAQ.

 

Capabilities

The Jupyter Flow tool allows you to:

  • Run Jupyter notebooks written in a number of languages (Python, Julia, etc...)
  • Pass data into and out of Python-based Jupyter notebooks
  • Customize the python environments in which your notebooks run

Additionally, workflows containing Jupyter Flow tools can be exported, environment and all. This makes them:

  • Shareable
  • Server compatible

 

What can I do with it?

  • Schedule python work with dynamic I/O
  • Orchestrate multiple notebooks together
  • Manage integrated notebooks without Alteryx Designer
  • Store integrated notebooks/environments on a shared network drive
  • Create python-powered analytic apps

What uses can you think of?

 

To the Moon...err Europa!

Two philosophies drive the Jupyter Flow tool:

1. Let users leverage Jupyter in all its glory

2. Simplify environment packaging and sharing

 

Similarly, the two things you need to run the Jupyter Flow tool are:

1. A Jupyter notebook

2. A site-packages folder

 

JeffA_0-1624999755252.png

Two inputs required to run a Jupyter Flow tool. Shown in the config pane.

 

After the first run (which, depending on the size of your site-packages folder, can be a wait - kind of like a trip to Jupiter), you will be gifted with a .pyz file. This is your ticket to sharing this workflow, running it on Server, and sharing environments (on a network perhaps?).

 

JeffA_1-1625000162363.png

The .pyz file which unlocks sharing, Server runs, and more.

 

Switch off the package watcher if you'd like to share your workflow or run it on Server.

 

JeffA_0-1625005360857.png

The packages toggle, which enables/disables environment building.

 

Now with a quick trip to the Export Workflow dialogue, we're ready to share this workflow with anyone who also has Jupyter Flow. And running the workflow on Server is as simple as working with any other workflow in Gallery.

 

Where's the Data?

Surely you'll want to pass Alteryx data into and out of your Jupyter notebook.

 

Jupyter Flow tries to noninvasively work with your notebooks. So reading data from or writing data to Alteryx involves the use of comments in the form of input and output tags.

 

There are four possible tags:

#ayx_input
#ayx_output
#ayx_input=
#ayx_output=

 

These tags are placed inside your notebook, above the data frame(s) you would like to replace with Alteryx data or output to Alteryx data. These tags do nothing when you're running your notebook outside of Alteryx. When Alteryx runs the notebook, however, it picks up on these tags and modifies the code to pass data in/out.

 

For example, you may have the following code in my Jupyter notebook as well as the following Jupyter Flow tool with available input connections #1 and #2:

lung_cancer_images = get_lung_cancer_images_dataframe()

JeffA_0-1625011747514.png

 

In order to assign `lung_cancer_images` to an incoming Alteryx data stream, I could do the following:

#ayx_input=#2
lung_cancer_images = get_lung_cancer_images_dataframe()

 

Now when Alteryx runs the Jupyter Flow tool, `lung_cancer_images` will be set to the data coming in on connection #2 in the workflow shown above, instead of `get_lung_cancer_images_dataframe()`. Alteryx will generate and run a `_post_processed` version of the notebook (along with a path to that notebook in the workflow messages) which is the version of the notebook connected to Alteryx. You can see and debug this notebook (see instructions under Advanced Options or the help docs).

 

A similar approach applies to all ayx tags. Check out more details in the Jupyter Flow Help Docs.

 

Advanced Options

But wait, there's more! Jupyter Flow also helps you with:

  • Managing your Jupyter Flow generated environments
  • Debugging support - run through your notebook line by line using Alteryx data from a previous run
  • Data cache location configuration (for security or data management needs)

 

Manage Environments

You may enable custom zip app (.pyz file; the environment file) paths. This allows you to generate your .pyz environments one time, save them on a network drive or other shared location, and force the tool to use that environment. Find this options under the Advanced accordion:

 

JeffA_1-1625009465065.png

 

Debug Notebooks

When Jupyter Flow runs your notebooks, it modifies a copy of them in the same directory the notebook exists. The copy will have `_post_processed` appended to its name. If you open this notebook, you can see what Jupyter Flow has done to enable data to flow into and out of the notebook. You can also run the notebook. However, for performance and security reasons, Jupyter Flow deletes its data caches by default. So to run these `_post_processed` notebooks, enable data cache backup under advanced options:

 

JeffA_2-1625009747512.png

 

Configure Data Cache Location

For security and performance reasons, Jupyter Flow automatically deletes all of its data caches after each run. However, if data cache backup is enabled (for debugging or other purposes), these caches will stick around. When Jupyter Flow runs, the messages section of Designer will inform you of where this data is being cached. If you do not like its chosen cache location (the system's temp directory), you may change this by setting the data cache location to "custom" and typing a path to your desired location:

 

JeffA_3-1625009913958.png

 

Integrate your notebooks with Alteryx using the Jupyter Flow tool and let us know what you think!

 

 

Banner image by Beate Bachman

 

Comments