community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx Designer Ideas

Share your Designer product ideas - we're listening!
#SANTALYTICS

Gather all 9 clues to complete the final Weekly Challenge on Dec 16!

Learn More

Optimal binning tool for predictive analytics

Idea:

Some well known scoring methods use optimal binned variables for added robustness. Let's add this capability to Alteryx.

 

Retionale:

Here's a basic link on why to do that; http://documents.software.dell.com/statistics/textbook/optimal-binning

 

Current status in Alterys as I'm aware of:

Tile tool or Multi-field Binning tool for completing same task as Tile tool on multiple fields, splits the variables by 5 methods;

  • Equal Records or Intervals or Sums

  • Smart Tile

  • Unique Value 

  • Manual

Unfortunately "equal something" binnings are bad idea, as the values are categorized "blindly" irrespective of the effects on the predictive power of the models. 

 

What to do:

What's needed is to bin both numerical and categorical variables optimally such that the Weights of Evidences (WoE) should present a monotone increasing or decreasing pattern. Maybe at most a V or U shaped "convex" structure.

 

Quick win:

Without constraining ourselves with monotonicity or convex cases, the easiest practice would be running a C4.5 or CHAID tree algorithm (produces multiple splits rather than binary splits in CART) for a single variable and select the target as the dependent variable and all the resulting nodes will be the bins we are looking for. Doing this for multiple variables at once is the key to the tool to be generated.

 

Clients:

This capability is sought by risk management departments building robust, stable Basel compliant models in financial industry, especially by banks.

7 Comments

We would love to see this addition too.

Alteryx Partner

Thanks Jeremy,

 

I kept on reading and here is the better solution:

Considering monotonicity and/or convexity cases as our constraints the ideal practice would be running a constarained optimisation algorithm for each variable. Doing this for multiple variables at once will be the key to the tool to be generated.

 

A similar capability exists in SAS scoring tool which seems that is why SAS tool is mostly used by financial institutions

Relevant solution again for SAS is considered in the following paper:
http://www2.sas.com/proceedings/forum2008/153-2008.pdf titled "SAS/OR®: Rigorous Constrained Optimized Binning for Credit Scoring"


There is also an R package called "smbinning" that you may find here: http://www.scoringmodeling.com/rpackage/smbinning/index.php?src=dsc20150221

 

* This is also mentioned in Revolution analytics blog:
http://blog.revolutionanalytics.com/2015/03/r-package-smbinning-optimal-binning-for-scoring-modeling...

 

So now I certainly believe it won't take much effort to include this in the next release...

 

Best

Alteryx
Alteryx

We thought about this long and hard two years ago, and consciensouly decided against it. Why? Because it is based on only binary comparisons (a predictor relative to a target) which completely ignores any possible interaction effects between the predictors and the target, leading to potentially biased models. More modern methods (e.g., the random forest method behind the Forest Model and the gradient based boosting method behind the Boosted Model tool) implicitly find the best way to "bin" a continuous predictor in a multivariate context. Put another way, while some other vendors may believe in encouraging poor practice on the part of users, we do not.

Alteryx Partner

There is some information lost with binning I agree.

But then the nonlinear relationships can be captured too... which we can't do in log reg with continuous variables.

Also Dan, don't you think the interactions can be captured by considering 2 predictor combinations relative to the target (binning using CHAID)?

 

Obviously in terms of interpretability it is definately a hard thing to go and explain a random forest model to a banking regulator...

Here is a FICO model documentation Experian scores the same, they deliberately do tailor-made binning after the auto binning to fix biases and match buisiness constraints, so approx. 1,2 billion people around the world have credit scored this way...  And an LGD model documentation from SAS on a similar basis.

 

Besides the analytcal pros and cons of the process most banks and Insurance firms are looking forward for the optimal binning or visual binning tools for their advanced analytics apps... Which we used extensively in SAS and SPSS Stat. By auto neglecting or rejecting the fact that there is common usage in several industries don't you think that gaining traction in those industries will be a little harder?

 

Best

DrDan,

 

That is a good point. In my former life as a professor, I was always suspcious of factor analysis for a similar reason. Coming from an econometrics background, I showed how factor analysis would lead to spurious results in the midst of complicated endogenous models.

 

However, in practice, I find myself often having to make compromises. For example, my company absolutely knows that some of our engagement activity is highly valued by some of our customers. The "correct" model would identify optimal engagment thresholds across different clusters of customers. A combination of clustering and hierarchical models could uncover this "correct" model. But, we have two major road blocks to unraveling such an elegant model: 1) lack of observations (we are a B2B company and do not have thousands/tens of thousands of observations to use such algorithmic approaches with confidence), and 2) we have noisy signals of engagement data.

 

Putting that altogether, even though I am a trained econometrician, I find myself in the need of such crude tools as a simplistic optimal binning algorithms. My dissertation advisors would cringe at this post by me, I wish I would have used a more cryptic username!

 

Thanks for reading these comments, always a huge fan of Alteryx.

Status changed to: Inactive
 

The status of this idea has been changed to 'Inactive'. This status indicates that:

 

1. The idea has not had activity in the form of likes or comments in over a year.

2. The idea has not reached ten likes.

3. The idea is still in the 'New Idea' status. 

 

However, this doesn't mean your idea won't be implemented! The Community can still like and comment on this idea. With enough renewed interest, this idea can be brought back into the 'New Idea' status. 

 

Thank you for contributing to the Alteryx Community and the Alteryx Product Idea Boards!