Data Science

Machine learning & data science for beginners and experts alike.
mstarks
Alteryx Alumni (Retired)

As a fairly new developer on the Core Engines team, I am thrilled to have the opportunity of adding functionality to the R Tool. While working on new features, I have taken a bit of time to understand more about R and how R can be used to solve real-world problems.

 

According to their home page, "www.r-project.org", R is a software environment for statistical computing. It allows users to define statistical models, perform analysis tasks, and plot results. Because R has been embraced by a large community of developers, its capabilities are continually expanding. In fact, over 3700 packages have been developed! These can be downloaded from the Comprehensive R Archive Network at "cran.r-project.org". Fortunately, only a small subset of the packages are required for most business applications.

 

What statistical analysis approaches are important in the context of business intelligence? Predictive modeling can be used for prospecting, qualifying prospects, cross-selling, up-selling, analyzing attrition and churn, and detecting fraud. Grouping can be used for market basket analysis, recommendation systems, fraud detection, and customer segmentation. Data mining allows large amounts of data to be summarized in a way that supports decision-making.

 

Specific predictive modeling techniques include linear regression, logistic regression, decision trees, and random forests. Predictive models allow you to estimate the probability of a given behavior based on previously acquired data. Grouping methods include K-Centroids clustering and hierarchical cluster analysis, along with association rules. Interesting patterns can emerge when you find useful ways to group data.

 

The initial work of bringing the capabilities of R into Alteryx was completed prior to the 7.0 Release. Here is a quick look at what can be done at this point.

 

The R Tool can be included directly in any module. It can accept multiple optional inputs. The in-coming connections can be read within R. Users can write their own R scripts to perform statistical analysis. (The "R Tool Predictive Analytics" sample module provides an example of how this works.) The R Tool has two optional output connections. The left output is for writing data values, and the right output is for generating graphs. Most problems that can be solved using R can now be addressed within Alteryx.

 

What can you expect to see in the future? Dr. Dan Putler has created several macros that can be used to accomplish the most common predictive tasks. These are going to be made available under a new "Predictive Tools" category within Alteryx. If you want to invest in the power of R, these macros are going to provide useful analysis techniques and examples to help you get started. We are beginning to look at how the data artisan can easily incorporate the results of predictive analytics activities into the Alteryx workflow. Predictive Model Markup Language (PMML) is the industry-standard approach for defining and sharing data mining models. You can expect to see some PMML support within Alteryx this year. Also, the new Alteryx R Data Exchange package is going to allow R scripts to read and write data in the Alteryx YXDB format. Additional features are in the planning stages.

 

I am fascinated by the strong interest in R expressed by the Alteryx community. We are committed to making predictive analytics a seamless part of the Alteryx user experience. Please contact me or Dr. Dan Putler if you have specific questions or recommendations for improving the R Tool. Thanks!

Comments
Atabarezz
13 - Pulsar

Does Alteryx have a PMML support?

 

  • Couldn't find any details over the net so far...
  • No mention of PMML on the help document either...

 

Best

 

Altan

DrDan
Alteryx Alumni (Retired)

Hi Altan,

We will be launching a new macro in the Public Gallery called the Output Model tool that will enable a predictive model created in Alteryx to be written to disk in PMML format. Not all model types will be supported, but those supported by the R pmml package are supported. This means that models created with the Linear, Logistic, Count, and Gamma Regression tools are supported, as are models created using the Decision Tree, Forest Model, Naive Bayes, Neural Network, and Support Vector Machine tools. The tool will also output any model type to R's own native file format as well. We will make an announcement on the EngineWorks blog when it is released.

Dan

Atabarezz
13 - Pulsar

That is awesome news, thanks for the informaiton Dan...

I'm looking forward to it...

 

Best

 

Altan

msumar
5 - Atom

Hi,

 

I need a basic crosstab report with statisitcal significance highlighted say either at 95% confidence level or 90% confidence level.

 

It looks that possibly using Alteryx with R may provide this solution. Do you have any sugestion that can make this work?

 

Thanks,

Umar

DrDan
Alteryx Alumni (Retired)

@msumar: The Contingency Table tool in the Data Investigation category provides the chi-square statistic of a test of independence between the two categorical fields being compared in the (two-way) table. The chi-square test of independence appears to be what your are referring to. Although, you really want to look at significance of these test at the 5% (which is what you mean by the 95%) level and the 10% (which is what you mean by the 90%) level. What this tool returns is the p-value of the test. The test is significant at the 5% level if the p-value is below 0.05, and it is significant at the 10% level is the p-value is below 0.10.

 

Dan

msumar
5 - Atom

Hi Dan,

 

Thanks for your message.

 

Is there a possibility of generating crosstabs in Alteryx similar to one below (see page 43) that have comparison groups clearly marked

 

http://www.analyticalgroup.com/download/active/WinCrossExploring.pdf

 

Thanks!

 

Umar

DrDan
Alteryx Alumni (Retired)

@msumar: The answer is really "sort of". Using t-tests of means and z-tests of proportions across cells of a contingency table is not common practice, in fact, this is the first example I've seen of it. As I indicated in my last post, common practice is to do a chi-square test of independence, which makes no underlying parameteric assumptions (the two tests you indicate actually make very strong parametric assumptions about the underlying statistical distributions that generated the data). In Alteryx you would need to conduct a number of seperate tests using the Test of Means tool for the t-tests, and, at the moment, use custom R code to do the z-tests of proportions. So it is possible, but cumbersome. However, I would tend to advise against it, partially for the issues surrounding the strong parametric assumptions that need to be made, and, probably more importantly, for a lot of contingency table analyses (particularly on standad marketing sample survey data that have fairly small sample sizes) there will be a high probability of making many type II errors (falsely rejecting significant effects) due to the small population sizes (resulting in a lack of power) for these tests. As a reference see: https://en.wikipedia.org/wiki/Type_I_and_type_II_errors

 

Dan

msumar
5 - Atom

Hi Dr Dan,

 

Sincerely appreciate your quick response.

 

In my opinion, Alteryx has beautifully incorporated some major features of commonly available statistical packages. However, the requested requirement is quite commonly used by most of the research agencies to the best of my knowledge for generating these crosstab tables, either using their own proprietary softwares or relying on outsourced softwares (<5).

 

Thanks a lot!

 

Umar