Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Predictive tool to replace missing values

tutankamon
7 - Meteor

Hi everyone,

 

I have a dataset in which I have one column with missing values and would like to replace them with the most appropiate value based on the other columns, since in this case to impute with the median, mean etc of the column does not make sense. I had thought of training a supervised model with the data we know and predict the unknown data, but I dont know how to implement this. The column with missing values is called rooms and I think we could use all the other columns to try to predict this value. I attach the dataset I am using.

 

Thanks in advance!!

5 REPLIES 5
SydneyF
Alteryx Alumni (Retired)

Hi @tutankamon,

 

What you are describing is certainly possible. The first step will be to train a supervised predictive model to estimate your column with missing values. You can select models from the Predictive tool palette appropriate for the data type of "Rooms" (it appears to possibly be integer data? The Count Regression tool might be a good model to investigate. Note that you may need to sample your data to effectively train a model with this tool). 

 

To train the model, filter the rows in your data set with missing values, and configure the model to estimate rooms. It is best practice to iteratively develop multiple models to find the best fit for your data. Once you have trained a model you are satisfied with, you can feed the model (the O output anchor in the predictive tools) and the data with missing values into a Score tool to estimate the rooms value for the rest of your data. You can then join and union the data back together, and use a Formula tool to replace the blank values with the estimated values created from the Score tool. 

 

2018-12-17_10-12-55.png

 

Does this process make sense? Are there any further questions I might be able to assist you with? I did some initial investigation on the data set you posted (with the very useful Field Summary tool), and found that Region and Zip code both had categorical values that only occurred for a few records. If categorical values are too granular, they can cause errors with some of the predictive tools (e.g., random forest) and are not helpful to the model for predictions. You may need to do some feature engineering (combining categories, etc.) to get your data to a point where all of the other columns can be used in a predictive model or exclude these fields as predictor variables in your model. 

 

Another thing I noticed about your data set is that the for the Terreno category in Kind of Property all Room values are Null. This will make it difficult to estimate the number of rooms for this property type if you use property type as a predictor variable. 

 

You may find the Titanic Series on the Data Science blog helpful as well. There is a post that specifically deals with missing values in a data set: Life or Death Missingness in the Titanic Data Set. Data investigation will be your best friend in this process.

 

Thanks!

 

Sydney

tutankamon
7 - Meteor

Hi @SydneyF,

 

Thank you very much for your clarifying and very well explained answer. I really appreciate how well you have explained all the steps throughout the workflow and the different issues you have found out in the dataset. I just wanted to ask you if you could attach the workflow to know what is the configuration of the tools you used. 

 

Thank you very much for your help!

 

Javier

SydneyF
Alteryx Alumni (Retired)

Hi @tutankamon,

 

Please find the workflow attached. I hope this helps demonstrate the process. I do not believe the model is returning reasonable results as it is currently configured. You will more than likely need to tweak the model and/or try other models in order to appropriately replace the missing values.

 

Thanks!

 

Sydney

tutankamon
7 - Meteor

Hi @SydneyF,

 

Thank you very much for all your help, it's been really helpful.

 

Kind regards,

Javier

 

TimothyL
Alteryx Alumni (Retired)

Hi @tutankamon ,

 

We have a new missing value imputation macros here: https://community.alteryx.com/t5/Data-Science/Expand-Your-Predictive-Palette-IV-Imputation-Beyond-Me...

 

Based on your post, the missForest one would be a great fit for your trial. Let us know what you think!

 

TL

Labels