Weekly Challenges

Verakso · ‎05-03-2019

If am not mistaken, the Expert Exam is also only two hours.

After have spending more than double the time on this one question, I either need to really step up my Alteryx skills, or seriously downgrade my expectations on parsing the Expert exam.

Spoiler

So the first part on identifying the 10 numeric variables was not that difficult - but are you even allowed to use Google at the exam?

But after consulted Google I used the Forest Model to find the 10 numeric variables with the highest mean decrease Gini coefficient.

Mean Decrease in Gini
And the I got stocked, apparently there was no automatic way forward from here, so the 10 variables identified, was manually entered into two Logistic Regression models

And then I got stucked again 😕
I blankly admit that I had no idea on how to find the Chi-Square nor what that means for that matter. but I ended up the same place as @danilang at the Nested Test Tool page.
That did not help me much, and normally I look for a better explanation on the tool on the Tool Mastery Index, but there was no Nested Test Tool here.
Then I normally go to the Live Training, to see if there is any videos on the subject, but again I was out of luck

Workflow
So I did what I'll assume what many others do when the give up - look what others have done, so I did take a deeper look at @danilang's solution, since it seemed so close to mine.

And by peer coincidence, we end up with the same result. 😁

Chi-Sq result
Personally I did not find the Help file on the Nested Tool page that helpful, and that is where I normally turn to the Tool Mastery for a further explanation, I did manges to get the two models compared, but was in doubt how they should be connected, and how/why to use the full data set as well.
I would have hoped for some better explanation on what this tool actually does and how it works, because right now - I am not any wiser.

I did take a peek into the macro, which can provides some additional help (at least on what to connect where), but what it actually does is still a mystery to me since it uses R and other sub-macros.

So the first part on identifying the 10 numeric variables was not that difficult - but are you even allowed to use Google at the exam?But after consulted Google I used the Forest Model to find the 10 numeric variables with the highest mean decrease Gini coefficient.Mean Decrease in GiniAnd the I got stocked, apparently there was no automatic way forward from here, so the 10 variables identified, was manually entered into two Logistic Regression modelsAnd then I got stucked again 😕I blankly admit that I had no idea on how to find the Chi-Square nor what that means for that matter. but I ended up the same place as at the Nested Test Tool page.That did not help me much, and normally I look for a better explanation on the tool on the Tool Mastery Index, but there was no Nested Test Tool here.Then I normally go to the Live Training, to see if there is any videos on the subject, but again I was out of luckWorkflowSo I did what I'll assume what many others do when the give up - look what others have done, so I did take a deeper look at 's solution, since it seemed so close to mine.And by peer coincidence, we end up with the same result. 😁Chi-Sq resultPersonally I did not find the Help file on the Nested Tool page that helpful, and that is where I normally turn to the Tool Mastery for a further explanation, I did manges to get the two models compared, but was in doubt how they should be connected, and how/why to use the full data set as well.I would have hoped for some better explanation on what this tool actually does and how it works, because right now - I am not any wiser.I did take a peek into the macro, which can provides some additional help (at least on what to connect where), but what it actually does is still a mystery to me since it uses R and other sub-macros.

So in the end, all this seems pretty simple and straight forward, but only it you know this stuff deeply and is a statistician (which I am definitely not).
I have tried to widen my horizon here, and as I mentioned - spent a lot of hours on this, mostly reading up on all the links I could find on this, but it is sure heavy stuff.

Take the decrease Gini coefficient for instance. What I have read then this is way of derimine how good the model is in the decision tree, and the lower the value is, the better (pure) you model is. If that is correct understood, then I think it is odd, that we should look at the ones with the highest value 😕

Still Climbing
/Verakso

Reesetrain2 · ‎05-08-2019

This was an interesting question.

-The H values were all Binary categorical variable whereas the other variables were linear.

-I used the Logistic Regression and then stepwise to eliminate the variables and then further used a Spline Model in order to see which of the linesr variables were of greatest importance as well as using a model without the categorical and examining the importance AND using stepwise to see which were eliminated.

-Very interesting Challenge.

Matt

bingqian_gao · ‎05-09-2019

Hi Verakso,

I will attempt to share some explanation to this one, hoping that if any of it doesn't make sense or is incorrect then someone more knowledgeable will come to rescue..

Spoiler

The way I approach to understand the Gini Importance is to go "backwards", starting from the surface level of interpreting the result:

First layer, "Mean Decrease in Gini" is a measure of variable importance for predicting a target variable. Since it's an "importance" measure, the higher it is, the more important it is.
Second layer, go one step back, Mean Decrease in Gini is the weighted average (mean) of a variable’s total decrease in node impurity, weighted by the proportion of samples reaching that node in each individual decision tree in the random forest. Since it's "impurity", it is bad, so the more it decreases (i.e. the higher the "mean decrease in Gini"), the better.
Third layer, go one step further back, Gini Impurity measures how often a randomly chosen record from the data set used to train the model will be incorrectly labelled (hence it's "bad") if it was randomly labelled according to the distribution of labels in the subset. This is essentially measuring the probability of a new record being incorrectly classified at a given node in a Decision Tree, based on the training data.

Regarding the Nested Test, it's essentially using a likelihood-ratio test (LR test) to compare the goodness of fit of two models. (https://en.wikipedia.org/wiki/Likelihood-ratio_test). The null hypothesis is that the full model is not better than the reduced model. When chi sq value is large enough (compare to the threshold), which also means the p-value is small enough, we can reject the null hypothesis, meaning we can conclude that one model (the full model) is better than the other (the reduced). In this case, since the chi sq is quite large, and the p-value is very small (typical threshold is 0.05), we can say that the effect of removing F_38 from the full model is SIGNIFICANT. The below table shows how to get to the p-value (range) from the chi sq (but Alteryx / R already gives you the p value so you don't need to worry about the conversion).

chi sq table_example.jpg

The way I approach to understand the Gini Importance is to go "backwards", starting from the surface level of interpreting the result:First layer, "Mean Decrease in Gini" is a measure of variable importance for predicting a target variable. Since it's an "importance" measure, the higher it is, the more important it is.Second layer, go one step back, Mean Decrease in Gini is the weighted average (mean) of a variable’s total decrease in node impurity, weighted by the proportion of samples reaching that node in each individual decision tree in the random forest. Since it's "impurity", it is bad, so the more it decreases (i.e. the higher the "mean decrease in Gini"), the better.Third layer, go one step further back, Gini Impurity measures how often a randomly chosen record from the data set used to train the model will be incorrectly labelled (hence it's "bad") if it was randomly labelled according to the distribution of labels in the subset. This is essentially measuring the probability of a new record being incorrectly classified at a given node in a Decision Tree, based on the training data. Regarding the Nested Test, it's essentially using a likelihood-ratio test (LR test) to compare the goodness of fit of two models. (https://en.wikipedia.org/wiki/Likelihood-ratio_test). The null hypothesis is that the full model is not better than the reduced model. When chi sq value is large enough (compare to the threshold), which also means the p-value is small enough, we can reject the null hypothesis, meaning we can conclude that one model (the full model) is better than the other (the reduced). In this case, since the chi sq is quite large, and the p-value is very small (typical threshold is 0.05), we can say that the effect of removing F_38 from the full model is SIGNIFICANT. The below table shows how to get to the p-value (range) from the chi sq (but Alteryx / R already gives you the p value so you don't need to worry about the conversion).

I hope this helps, and please let me know if anyone sees anything that I said was faulty...

Bingqian

Kenda · ‎05-17-2019

Spoiler

This would have been disastrous without help of community resources! Luckily I am familiar with the mathematics behind the tools, but I have never used 2 of the 3 key tools before. Uf! This is pretty cool that Alteryx can test models like this, though!

jasperlch · ‎05-25-2019

Solution attached.

DanHare · ‎06-09-2019

Good fun reading around all this, and this wouldn't have been my immediate choice for Friday !

Spoiler

jamielaird · ‎06-23-2019

Here's my solution

Spoiler

David-Carnes · ‎07-02-2019

Wish I had known then what I know now. I skipped over this in my first attempt at the Expert exam. Took me 9 minutes, start to finish, for the Challenge, and that included some reading of documentation.

Spoiler

LordNeilLord · ‎07-14-2019

14

Spoiler

RWvanLeeuwen · ‎08-13-2019

Now this I like

Spoiler

Short answer: 15

Long Answer:

Spoiler

Is Gain Ratio the same as Gini increase? I think it is impossible that any rankings change with this different interpretation. Please do correct me if I'm wrong on that.

Short answer: 15Long Answer:Is Gain Ratio the same as Gini increase? I think it is impossible that any rankings change with this different interpretation. Please do correct me if I'm wrong on that.

Weekly Challenges

IDEAS WANTED

Challenge #157: An Expert Challenge