Hi everyone,
I designed a macro that computes the Chow-Liu tree of categorical datasets. You can use it on any other dataset as long as you first turn non-categorical fields into categories, notably trough binning (or tiling as it is called in Alteryx) of numerical fields.
The point of a Chow-Liu tree is to show the main dependencies between the variables (categorical fields) at hand. You can replace the word “dependency” with “correlation” or “mutual information” or “transinformation” or “cross-entropy”. The dependencies will not de oriented thus do not show the direction of the potential causal effects at work between the variables (but Judea Pearl has proven that one cannot infer full causal structures based only observational data anyhow). The method eliminates the weakest statistical dependencies in order to achieve a maximum-entropy spanning tree, i.e. a tree without loops or in other words in which the path between any two variables is unique. The model obtained is in general an approximation yet a very simple to interpret and in most cases quite relevant one.
You may use this tree…
- to get a quick overview of your dataset and its internal dependencies;
- before considering predictive analytics so you focus on most relevant influencers on your target variable;
- as starting point to build a Bayesian network;
- to build a causal model (e.g. a causal Bayesian network) by turning the non-oriented edges into causal arrows.
Have fun with this macro. In case it turns out to be useful for you (!) do not hesitate to give feedback about your implementation and the kind of datasets you applied the macro to.
Gottfried
Technical details about the workflow
https://en.wikipedia.org/wiki/Chow-Liu_tree
The main steps in the Alteryx macro are:
- Data acquisition and preparation (notably removing unique value fields)
- Calculate transinformation between all pairs of variables (categorical fields).
- Run a minimum-weight spanning tree (MST) algorithm using transinformation (actually negative transinformation) as weight.
I used a MST algorithm written in R found on CRAN (Comprehensive R Archive Network) which I adapted in the R tool of Alteryx.
https://rdrr.io/cran/edmcr/src/R/mst.R
You may also use the very nice “pure Alteryx” macro designed by @StephaneP (Stéphane Portier) based on the famous Prim’s algorithm.
Prism s algorythmv3.yxzp
https://en.wikipedia.org/wiki/Prim%27s_algorithm