This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I think the Nearest Neighbor Algorithm is one of the least used, and most powerful algorithms I know of. It allows me to connect data points with other data points that are similar. When something is unpredictable, or I simply don't have enough data, this allows me to compare one data point with its nearest neighbors.
So, last night I was at school, taking a graduate level Econ course. We were discussing various distance algorithms for a nearest neighbor algorithm. Our prof discussed one called the Mahalanobis distance. It uses some fancy matrix algebra. Essentially it allows it it to filter out the noise, and only match on distance algorithms that are truly significant. It takes into account the correlation that may exists within variables, and reduces those variables down to only one.
I use Nearest Neighbor when other things aren't working for me. When my data sets are weak, sparse, or otherwise not predictable. Sometimes I don't know that particular variables are correlated. This is a powerful algorithm that could be added into the Nearest Neighbor, to allow for matches that might not otherwise be found. And allow matches on only the variables that really matter.
This idea arose recently when working specifically with the Association Analysis tool, but I have a feeling that other predictive tools could benefit as well. I was trying to run an association analysis for a large number of variables, but when I was investigating the output using the new interactive tools, I was presented with something similar to this:
While the correlation plot draws your high to high associations, the user is unable to read the field names, and the tooltip only provides the correlation value rather than the fields with the value. As such, I shifted my attention to the report output, which looked like this:
While I could now read everything, it made pulling out the insights much more difficult. Wanting the best of both worlds, I decided to extract the correlation table from the R output and drop it into Tableau for a filterable, interactive version of the correlation matrix. This turned out to be much easier said than done. Because the R output comes in report form, I tried to use the report extract macros mentioned in this thread to pull out the actual values. This was an issue due to the report formatting, so instead I cracked open the macro to extract the data directly from the R output. To make a long story shorter, this ended up being problematic due to report formats, batch macro pathing, and an unidentifiable bug.
In the end, it would be great if there was a “Data” output for reports from certain predictive tools that would benefit from further analysis. While the reports and interactive outputs are great for ingesting small model outputs, at times there is a need to extract the data itself for further analysis/visualization. This is one example, as is the model coefficients from regression analyses that I have used in the past. I know Dr. Dan created a model coefficients macro for the case of regression, but I have to imagine that there are other cases where the data is desired along with the report/interactive output.
A lot of popular machine learning systems use a computer's GPU to speed up some of the math to a huge degree. The header on this article on Medium shows a 15x difference from a high-end CPU vs a high-end GPU. It could also create an improvement in the spatial tools. Perhaps Alteryx should add this functionality in order to speed up these tools, which I can imagine are currently some of the slowest.
It would be nice if this option would take you to the correct download page relative to the version the user has installed. Currently, this always loads the download page for the current version which is confusing for users of a company who are still required to use an older version.
Up to version 10.0 I could open pretty much all analytics tools as a macro, to tweak things in R or in the macro workflow to get the results in a way most useful to us.
But apparently with Alteryx 11.0 the newer tools does not have that option, Although we can still access the older versions of those tools and still open them as macro but I don't understand (may be because they have interactive report option) why that is being killed in the newer versions?
Most of the newer versions have new features, like Linear Regression now support elastic net and cross validation etc.. but I still want to be able to go in to them to tweak them.
Python pandas dataframes and data types (numpy arrays, lists, dictionaries, etc.) are much more robust in general than their counterparts in R, and they play together much easier as well. Moreover, there are only a handful of packages that do everything a data scientist would need, including graphing, such as SciKit Learn, Pandas, Numpy, and Seaborn. After utliizing R, Python, and Alteryx, I'm still a big proponent of integrating with the Python language much like Alteryx has integrated with R. At the very least, I propose to create the ability to create custom code such as a Python tool.
Would be extremely useful if the Summarize Tool had an option in the numeric menu to Standardize the data. More often than not, data sets will not have the same count of variables which makes the comparison analysis meaningless. Currently, there is no easy way to Standardize the data without using the K-Centroids Cluster Analysis tool or standardize_unit interval supporting macro.
But it's still to hard to use. It requires you to have pre-knowledge of a bunch of parameters and different types of knowledge.
Can we improve the interface on this tool so that it can be used by folk who do not have a background in R - for example, take all the different inputs, and make them parameterized on drop-down boxes or input boxes on the tool?
It is important to be able to test for heteroscedasticity, so a tool for this test would be much appreciated.
In addition, I strongly believe the ability to calculate robust standard errors should be included as an option in existing regression tools, where applicable. This is a standard feature in most statistical analysis software packages.
TLDR: Add a parameter repeat (rep) to the Neural Network configuration panel, and probably other Predictive tools. This would make the tool return only the best trained model out of #rep.
I have done some research by comparing the R neuralnet function (from Neuralnet package) and the Alteryx Neural Network tool. I thought that the XOR example was a good one to evaluate a neural network system since it is one of the most basic non-linear tables.
However several attempts must be made. This is why the parameter
is important, followed with
This way, instead of 1, 5 sets of initial parameters are randomly generated and 5 neural networks are trained independently. When plotting, only the neural network with the lowest error is plotted (I got 3% error). For XOR, around 5 tries seem to be enough to find a set of parameter that guesses the 4 situations of the table correctly.
Another parameter of the function is stepmax:
"The maximum steps for the training of the neural network. Reaching this maximum leads to a stop of the neural network’s training process."
This rules the max number of iteration for each single attempt viewed previously.
If one writes stepmax = 100 instead of rep = 5, the only resulting Neural Network usually does not have an average error below 5% (actually I get 49,7%).
Alteryx Neural Network
The Neural Network tool in Alteryx has a parameter documented as follow
"The maximum number of iterations for model estimation: This value controls the number of attempts the algorithm can make in attempting to find improvements in the set of model weights relative to the previous set of weights. If no improvements are found in the weights prior to the maximum number of iterations, the algorithm will terminate and return the best set of weights. This option defaults to 100 iterations. In general, given the behavior of the algorithm, it is likely to make sense to increase this value if needed, at the cost of lengthening the runtime for model creation."
I tried this workflow with this parameter set to default value 100, and even greater values, and I cannot get an average error below 5% (I get stuck around 50%).
After reading carefully these pieces of documentation and testing, I can guess that this last parameter is the equivalent of stepmax.
If this is right, then it would be practical to add an equivalent for the parameter rep to the configuration panel. This parameter seems more useful than rep to me. Maybe there is already a way to simulate this parameter; if that is so, please let me know, otherwise I am going to include an R script to my workflow.
I suggest adding a piece wise linear regression tool to the predictive tool set. Many times, modelers in the insurance business (and others) like to play around with the model, i.e. placing knots/splines at multiple intervals then after seeing the results, changing where those knots are placed. This gives the best fit based not only on mathematical technical knowledge but also domain/tribal knowledge, especially when training data sets are small.
Currently, we have to try and develop our own model matched with a macro which changes the r-code. To date, there has been no luck, but maybe the Alteryx excellent customer support would get the developers to do this for us and then...........we could finally bury SAS in the ground.
I had an issue with the ARIMA tool using the R function auto.arima() which missed the obvious seasonality in my data set, but when I manually adjusted Model Customization > Customize the parameters used for automatic model creation > The seasonal components > Alter the degree of seasonal differencing and selected 1, the outputs were much better. The problem is that while this fixes my singular ARIMA model, I want to be able to run this through Model Factory too as I have hundreds of data sets on which I'd have to make this adjustment since auto.arima() misses the seasonality on most of them and produces useless results. Apparently I'm not alone in having this issue as I was directed to another comment thread where someone had the same problem and desired fix as me.
TL/DR version -- please make is possible to manually adjust the parameters of the ARIMA Model Factory tool, or at the very least add the option to alter the degree of seasonal differencing.
And for supporting information, here is my thread:
We recently upgraded our SQL server to 2016 to enable us to use R Server for predictive analytics. We were excited about the more powerful algorithms and the fact that parallel processing will make things faster on bigger data sets.
We often use stepwise logistic regression, especially in cases where we need to show which attributes are most significant. The one drawback about the upgrade was that stepwise is not available when running logistic regression in-database. I know there are ways to get around this e.g. PC etc. but it would be nice to have the ability to do stepwise in-database.
I hope there are others like me that will vote this up. I think it will help a lot of data scientists out there and is probably one of the easier suggestions :-).
Since we know Alteryx uses R for a lot of its predictive and data analysis tools. It takes a while to run the workflow whenever there is R based tool is involved. I was told by a solution engineer that its because its opening and closing R in the background.
Sometimes my workflow has a bunch of tools which are running R in the background and it takes forever to run the workflow.
I think there should be a user setting which allows user to choose if the want to start R along with Alteryx and keep it running in the background.
Improve Help Documentation or in-tool options for handling null values in statistical tools like Weighted Average or Linear Regression. For instance, checkbox to remove null value records, or at least warn users.
In the processing of learning to perform linear regression in RStudio and Alteryx, I came across differing outputs depending on how null values were addressed. Take the Weighted Average tool for example.
In R, the weighted.mean function treats null values in the variable of interest as if they were not there. If the user does not specify that null values exist, the result is NA. If any null values exist in the weight field, the result is NA.
Since I am more familiar with Alteryx, I originally did the data preparation—including calculating the weighted means—in Alteryx. When comparing these weighted means with those generated in R, I found that Alteryx treats the null values as zeros (i.e. includes them in the calculation). The user would have to know this is incorrect and first filter out the null values. See screenshot examples.
This is also the case within the Linear Regression tool. If null values are not omitted prior to regression, the results are wildly different. Perhaps this is known by more experienced users/statisticians, but this incorrect usage would have gone on unbeknownst to be had I not cross-checked with RStudio.
Error: Cross Validation (58): Tool #4: Error in tab + laplace : non-numeric argument to binary operator
This is odd, because I see that there is special code that handles naive bayes models. Seems that the model$laplace parameter is _not_ null by the time it hits `update`. I'm not sure yet what line is triggering the error.
Some of the predictive tools put out a "Score" field when output is run through the scoring tool, and some put out a "Score_1" and/or "Score_0". Since I frequently reuse the same workflow template for different predictive model types, it would be nice if they were consistent so that I wouldn't have to crash the workflow the first time through to get the input field names correct for downstream tools (e.g., Sort). Thank you