Alteryx Designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.
SOLVED

Creating a KNN imputer for Alteryx using Python and Sci-Kit Learn

amir_alteryx_2021
6 - Meteoroid

Hello everyone,

 

I am in a bit of a dilemma. Firstly I am new to the python tool and secondly, I am not sure how it works aside from it looking like a normal Jupyter notebook.

 

My objective is simply to impute missing data using the following prebuilt function from sci-kit learn. This would be an alternative to the imputation via mean, mode, or median. I have had positive results using this: 

https://machinelearningmastery.com/knn-imputation-for-missing-values-in-machine-learning/

 

So here is what I can do in Jupyter:

  • Load data frame from a CSV(using the Titanic dataset)
  • run a function that essentially gives me the variants of odd neighbors that then are collected in a pandas data frame and then we get the average of the Root Mean Squared Error. That is a float. We then round it up to the nearest whole number. We use that number as our best choice for the nearest neighbor and proceed to impute.
  • once imputed the nans are changed and we can merge back with the rest of the dataset that would have all the nonnumerical categorical data. 

What I have gotten done in Alteryx:

  • Modified the script to change the data frame loading to the incoming data from Alteryx
  • Changed the data types and field selection
  • created the final data frame and written it to the output

So far no cigar. It seems that it does not essentially run the notebook every time I run the notebook which I am not sure of. I guess my desire is to ultimately make this into something that could be used the same way as the impute tool. I have included the jupyter notebook and the workflow.

 

Any help would be appreciated in getting this to work. As the alternative is to just not use Alteryx for data prep or to pre-rinse the data with python before loading both which seem to defeat the spirit of a one-source solution. Am I going about this wrong? Should I be trying to build this out in the python sdk?

 

**also why does the alteryx notebook keep  not saving the library addition to the code. Maybe I am missing something but it seems like it does not save changes. 

 

if you have issues with the code you might need to add the following line to the imports. 

from ayx import Alteryx

3 REPLIES 3
DataCurious_Nick
6 - Meteoroid

Hi Amir,

 

Your workflow is looking pretty good from my side - all that I've added is: 

 

from ayx import Alteryx  (as you suggested - this allows the Python notebook to pick up the Alteryx specific libraries for reading/writing datasets)

 

and then added a Browse at the end of the workflow after output node #1 (so that the data frame streams back into the main Alteryx workflow). 

 

Now, every time you run the workflow, the notebook is executed in sequence and runs the KNN and imputes missing ages, before sending these results back to the main Designer workflow. From here, it's simple to blend back the original fields that you excluded in your Select tools using a Join on the Passenger ID, or to continue to develop no-code/low-code models using the Predictive Suite or Intelligence Suite (depending on what you have installed). 

 

Hope this answers your question - I've included the updated workflow in the reply. 

 

Cheers,

Nick

amir_alteryx_2021
6 - Meteoroid

Thanks Nick!

 

On your last point, how would I take this maybe build out the python function is in the alteryx notebook and convert it into something like the impute tool?

This way it would truly be drag and drop? Am I asking or trying to do too much?

Alternatively, why does Alteryx not have this built-in?  

DataCurious_Nick
6 - Meteoroid

No probem Amir!

 

I think you're thinking along the right lines - once you've got some working Python code that you want to make more 'repeatable' inside an Alteryx workflow then you've got two main options: 

- Wrap the python code inside a new tool created with the Python SDK

(this way, it truly becomes a 'tool' for users and all the complexity is hidden away)

There's a few pros and cons with this: Python SDK work takes a bit of work, but can produce some robust tools that can be distributed across your org. 

 

- Wrap the python tool inside a macro and then save/distribute the macro as needed. 

(this way, you get a new no-code tool that performs all the functions of the original workflow)

Again: pros and cons. Much, much quicker to get started. You *can* encrypt the macro so that the inner-workings aren't visible to end users, but most macros are visible to end users (right click, open macro, etc.)

 

I'm attaching an example of your KNN Imputation logic inside a macro so you can see what I mean. In this workflow we have a minimum of tools: Input > AutoField > MACRO > Browse. 

 

If you right-click on the macro tool in the workflow and select 'Open Macro' you'll see the tools I've used to pass the information down to the python logic, and then pass the results back up to the main workflow. (If this is new to you: Alteryx Community has some good macro learning content, but it's really not too hard to get started).

 

Also, a caveat: this isn't bulletproof, just a proof-of-concept for you to get started! 😎

 

Hope this helps!
Nick

Labels