community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx Knowledge Base

Definitive answers from Designer experts.
Upgrade Alteryx Designer in 10 Steps

Debating whether or not to upgrade to the latest version of Alteryx Designer?

LEARN MORE

Impute Missing Data Values with a Custom Formula

Sr. Data Science Content Engineer
Sr. Data Science Content Engineer
Created on

In statistics, imputation means replacing missing data with substituted values. This is an important thing to be able to do in many data science tasks, particularly in model building. Many of the Alteryx predictive tools, which are coded in R, will not accept variables with null or missing values. Imputation allows you to include rows or fields of data with null values in these models by simulating a value to replace the nulls.

 

Alteryx does have a built-in Imputation Tool, which will let you do all sorts of handy imputations such as replacing nulls with the field mean, median, mode, or even a constant specified by the user. These options cover the most common imputations a data scientist may need to perform. But what if you have a calculation in mind you wanted to use to replace your null values not included in the basic averages available in the Imputation tool? This requires performing your imputation with a workflow instead of the tool.

 

The good news is that this can easily be accomplished with a few tools from the Transform Category, a Join and a Formula.

 

2017-10-23_14-24-12.png

 

First, use the Summarize Tool to find the average of each of the fields you need to impute values for. Here, select the summary statistic you are interested in (such as mean, median or mode) for each field you are imputing values to. You also have the ability to group the summarized values by another field. This can be handy if you want unique imputations for different groups of data. For example, if you had multiple purchase values for each customer, and wanted to impute values for customers with null values based on the average purchase value for that customer, then you would include a group by customer. This will only work if there are enough valid values in each group (it is impossible to impute a value if there are no values to start with).

 

2017-10-23_14-25-21.png

 

Next, use the Transpose Tool to shift your fields of interest into a single column. Concurrently, you will use another Transpose Tool on your original fields of interest so that all of the values in each of your fields, including nulls, are in a single column. If you are grouping your imputations by an additional field, make sure to select that field as a Key Field in your transpose tools.

 

Then, Join the two data streams together on the Name Field created by the Transpose Tool. This column contains the name of the field each data point in the value field belongs to. If you've grouped your values for imputation, you would also join on your grouping field. This join results in a combined data set where each value has an associated field average with it.

 

 2017-10-23_14-26-44.png

 

Use the Formula Tool to calculate your imputation value and replace nulls in it with a formula like this:

 

2017-10-23_14-27-55.png

 

Where org_value is the field with your original data, and avg_value is the calculated average for each field generated by the Summarize tool. In this step, you could easily modify the replacement value (everything between Then and Else) to factor in a constant offset or other calculations.

 

Finally, use the Cross Tab tool to return your data to its original format.

 

2017-10-23_14-28-51.png

 

And there you have it! A data set with custom imputed values! See the attached v11.7 workflow CustomImputation.yxmd for the above example.

 

Remember, with great power comes great responsibility. Make sure your custom imputations are statistically valid and make sense for your data set. Smiley Happy 

Attachments
Comments
Asteroid

Hi @SydneyF,

 

I am working on a Time Series problem which has missing data for certain combinations of the data. I had initially substituted all the null values in the dataset with 0s. I then fed this dataset into ARIMA and ETS Time Series Tools.

 

The ARIMA and ETS tools produced negative values of the forecast for certain combinations in the dataset. The negative values for forecast for a few combinations drew interest of Data Sciences enthusiasts who suggested that I could impute the missing data based on Trend and Seasonality present in the dataset instead of replacing those values with 0s.

 

Could you provide any thoughts on the considerations that must be factored into the imputation exercise for a Time Series problem.

 

Regards,

Ashish

Sr. Community Content Manager
Sr. Community Content Manager

Check out this article which contains an example workflow that fills missing values with the average of the immediately preceding and proceeding period. You could also experiment with the AB Trend tool which was designed for use with A/B Analysis, but could be applicable to your use case, as well.