community
cancel
Showing results for 
Search instead for 
Did you mean: 
Do you have the skills to make it to the top? Subscribe to our weekly challenges. Try your best to solve the problem, share your solution, and see how others tackled the same problem. We share our answer too.
Weekly Challenge
Do you have the skills to make it to the top? Subscribe to our weekly challenges. Try your best to solve the problem, share your solution, and see how others tackled the same problem. We share our answer too.
Unable to display your progress at this time. Please try again a little later, or contact an administrator if you continue to see this error.
Announcement | Get certified today - take the Alteryx Designer Core and Advanced exams on-demand now!

Challenge #157: An Expert Challenge

Alteryx Partner

If am not mistaken, the Expert Exam is also only two hours.

 

After have spending more than double the time on this one question, I either need to really step up my Alteryx skills, or seriously downgrade my expectations on parsing the Expert exam.

 

Spoiler
So the first part on identifying the 10 numeric variables was not that difficult - but are you even allowed to use Google at the exam?

But after consulted Google I used the Forest Model to find the 10 numeric variables with the highest mean decrease Gini coefficient.
2019-05-02 14_48_36-Alteryx Designer x64 - challenge_157_start_file.yxmd - Browse (9) (Subset).pngMean Decrease in Gini
And the I got stocked, apparently there was no automatic way forward from here, so the 10 variables identified, was manually entered into two Logistic Regression models

And then I got stucked again 😕
I blankly admit that I had no idea on how to find the Chi-Square nor what that means for that matter. but I ended up the same place as @danilang at the Nested Test Tool page.
That did not help me much, and normally I look for a better explanation on the tool on the Tool Mastery Index, but there was no Nested Test Tool here.
Then I normally go to the Live Training, to see if there is any videos on the subject, but again I was out of luck
2019-05-03 13_28_35-Alteryx Designer x64 - challenge_157_solution_verakso.yxmd_.pngWorkflow
So I did what I'll assume what many others do when the give up - look what others have done, so I did take a deeper look at @danilang's solution, since it seemed so close to mine.

And by peer coincidence, we end up with the same result. 😁
2019-05-03 13_29_10-Alteryx Designer x64 - challenge_157_solution_verakso.yxmd_.pngChi-Sq result
Personally I did not find the Help file on the Nested Tool page that helpful, and that is where I normally turn to the Tool Mastery for a further explanation, I did manges to get the two models compared, but was in doubt how they should be connected, and how/why to use the full data set as well.
I would have hoped for some better explanation on what this tool actually does and how it works, because right now - I am not any wiser.

I did take a peek into the macro, which can provides some additional help (at least on what to connect where), but what it actually does is still a mystery to me since it uses R and other sub-macros.
 

So in the end, all this seems pretty simple and straight forward, but only it you know this stuff deeply and is a statistician (which I am definitely not).
I have tried to widen my horizon here, and as I mentioned - spent a lot of hours on this, mostly reading up on all the links I could find on this, but it is sure heavy stuff.

Take the decrease Gini coefficient for instance. What I have read then this is way of derimine how good the model is in the decision tree, and the lower the value is, the better (pure) you model is. If that is correct understood, then I think it is odd, that we should look at the ones with the highest value 😕

 

Still Climbing
/Verakso

 

 

 

 

 

Asteroid

Chi SQ Image.png

This was an interesting question.

-The H values were all Binary categorical variable whereas the other variables were linear.

-I used the Logistic Regression and then stepwise to eliminate the variables and then further used a Spline Model in order to see which of the linesr variables were of greatest importance as well as using a model without the categorical and examining the importance AND using stepwise to see which were eliminated.

-Very interesting Challenge.

 

Matt

Alteryx Certified Partner

Hi Verakso,

 

I will attempt to share some explanation to this one, hoping that if any of it doesn't make sense or is incorrect then someone more knowledgeable will come to rescue..

 

Spoiler

 

 

1.jpg

 

 

The way I approach to understand the Gini Importance is to go "backwards", starting from the surface level of interpreting the result:

  • First layer, "Mean Decrease in Gini" is a measure of variable importance for predicting a target variable. Since it's an "importance" measure, the higher it is, the more important it is.
  • Second layer, go one step back, Mean Decrease in Gini is the weighted average (mean) of a variable’s total decrease in node impurity, weighted by the proportion of samples reaching that node in each individual decision tree in the random forest. Since it's "impurity", it is bad, so the more it decreases (i.e. the higher the "mean decrease in Gini"), the better.
  • Third layer, go one step further back, Gini Impurity measures how often a randomly chosen record from the data set used to train the model will be incorrectly labelled (hence it's "bad") if it was randomly labelled according to the distribution of labels in the subset. This is essentially measuring the probability of a new record being incorrectly classified at a given node in a Decision Tree, based on the training data. 

 

 

2.jpg

 

Regarding the Nested Test, it's essentially using a likelihood-ratio test (LR test) to compare the goodness of fit of two models. (https://en.wikipedia.org/wiki/Likelihood-ratio_test). The null hypothesis is that the full model is not better than the reduced model. When chi sq value is large enough (compare to the threshold), which also means the p-value is small enough, we can reject the null hypothesis, meaning we can conclude that one model (the full model) is better than the other (the reduced). In this case, since the chi sq is quite large, and the p-value is very small (typical threshold is 0.05), we can say that the effect of removing F_38 from the full model is SIGNIFICANT. The below table shows how to get to the p-value (range) from the chi sq (but Alteryx / R already gives you the p value so you don't need to worry about the conversion).

 

chi sq table_example.jpg

 

 

 

I hope this helps, and please let me know if anyone sees anything that I said was faulty... 

 

Bingqian

Pulsar
Pulsar
Spoiler
This would have been disastrous without help of community resources! Luckily I am familiar with the mathematics behind the tools, but I have never used 2 of the 3 key tools before. Uf! This is pretty cool that Alteryx can test models like this, though!

Capture.PNG
Quasar

Solution attached.

Alteryx Certified Partner

Good fun reading around all this, and this wouldn't have been my immediate choice for Friday !

 

Spoiler
157_Workflow.PNG157_ChiSquaredDifference.PNG
Alteryx Certified Partner

Here's my solution

 

Spoiler
Screenshot 2019-06-23 at 10.07.37.pngScreenshot 2019-06-23 at 10.07.54.png

Wish I had known then what I know now.  I skipped over this in my first attempt at the Expert exam.  Took me 9 minutes, start to finish, for the Challenge, and that included some reading of documentation.

 

Spoiler
157.png
Alteryx Certified Partner

14

Spoiler
Capture.PNG