This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I think the Nearest Neighbor Algorithm is one of the least used, and most powerful algorithms I know of. It allows me to connect data points with other data points that are similar. When something is unpredictable, or I simply don't have enough data, this allows me to compare one data point with its nearest neighbors.
So, last night I was at school, taking a graduate level Econ course. We were discussing various distance algorithms for a nearest neighbor algorithm. Our prof discussed one called the Mahalanobis distance. It uses some fancy matrix algebra. Essentially it allows it it to filter out the noise, and only match on distance algorithms that are truly significant. It takes into account the correlation that may exists within variables, and reduces those variables down to only one.
I use Nearest Neighbor when other things aren't working for me. When my data sets are weak, sparse, or otherwise not predictable. Sometimes I don't know that particular variables are correlated. This is a powerful algorithm that could be added into the Nearest Neighbor, to allow for matches that might not otherwise be found. And allow matches on only the variables that really matter.
A lot of popular machine learning systems use a computer's GPU to speed up some of the math to a huge degree. The header on this article on Medium shows a 15x difference from a high-end CPU vs a high-end GPU. It could also create an improvement in the spatial tools. Perhaps Alteryx should add this functionality in order to speed up these tools, which I can imagine are currently some of the slowest.
XGboost regression is now the benchmark for every Kaggle competition and seems to consistently outperform random forest, spline regression, and all of the more basic models. For those of us using predictive modeling on a regular basis in our actual work, this tool would allow for a quick improvement in our model accuracy. And I think, from a marketing standpoint, having a core group of users competing in Kaggle using Alteryx would be a great way to show off Alteryx's power.
I checked out the "Boosted" model and see that it basically wraps the "gbm" model in R. I would like to request a similar wrapping for the newer xgb (or xgboost) -- eXtreme Gradient Boosting, which is very fast and accurate, and is winning Kaggle competitions left and right. It would be a great addition and is something SAS probably won't have it for another 10 years, if ever.
This request is largely based on the implementation found on AzureML; (take their free trial and check out the Deep Convolutional and Pooling NN example from their gallery). This allows you to specify custom convolutional and pooling layers in a deep neural network. This is an extremely powerful machine learning technique that could be tricky to implement, but could perhaps be (for example) a great initial macro wrapped around something in Python, where currently these are more easily implemented than in R.
It would be great if we could output the coefficients of regression equation to a table so that one can use them in rest of the module. Currently, Alteryx can output the table/coefficients in charts/reports form which is not re-usable as such in the module. The values of coefficients/Residuals/Errors would be very useful in building macros for techniques like Missing Value Analysis which can't be done in Alteryx as of now.
I am not sure if this capability exists but I assume it does not.
We have a need to optimize a Linear Program (LP) model that consists of a system of equations and has both: An objective function and a series of constraints. One of the software capabilities that SAS offers that currently Alteryx does not have is this optimization capability.
I am wondering if the capability is currently not available, is this capability in the Product Roadmap?
Hello! Almost all statistical softwares allow for the analyst to use either a pairwise or a listwise option when applying clustering techinques. This option affects only how the inner distance matrix is built, and after that whichever algorithm you choose is peformed. However in Alteryx [K-Centroids] by default does listwise, classifying only those records where the selected variables have no nulls.
Please consider adding this option!
PS: the difference is pairwise will build the distance between 2 variables depending on those records that have no nulls on both variables, while listwise will run the distance matrix after it has checked for complete non null records in all variables of interest (not one at a time distance calculation).
I am trying to run batch regressions on a pretty sizable set of data. About ~1M distinct groups of data, each wtih 30-500 x,y pairs.
A batch macro with a linear regression works ok - but it is really slow. Started at about 2-3s per regression. After stripping out bunch or reporting from the macro, I am down to ~2s. This is still feels quite slow compared to something purpose built.
Has anyone experimented with higher speed versions that just dump out m,b, & r2?
1) either start seperate "Alteryx-kaggle" instances with data sets specific to each kaggle competition so that anyone want to try out may have a go with those well known examples thru the Alteryx site,
2) Or even better have a partnership with kaggle so that anyone can just have it's own Alteryx trial per specific competition on the kaggle website...
I'm sure this will draw a lot of attention...
You'll immediately have a greater reach in Kaggle community, some data hobbiyists and cs, ie students and acedemics (which will eventually end up doing lot's of data blending when ther are going to be hired by top notch firms...
In forecasting and in commercial/sme risk scoring there is a need for trying vast number of algebraic equations which is a very cumbersome prosess. Let's add symbolic regression as a new competitive capability.
Summations, ratios, power transforms and all combinations of a like are needed to be tested as new variables for a forecasting or prediction model. Doing this by hand manually is a though and long business... And there is always a possibility for one to skip a valuable combination.
Symbolic regression is a novel techinique for automatically generating algebraic equations with use of genetic programming, In every evolution a variable is selected checked if the equation is discriminatitive of the target variable at hand. In every next step frequently observed variables will be selected more likely.
Benefit for clients:
This method produces variables mainly with nonlinear relationships. It is a technique that will help in corporate/commercial/sme risk modelling, such that powerful risk models are generated from a hort list of B/S and P/L based algebraic equations. There is potential use cases in algorithmic trading as well...
There are 3 very interesting world problems solved with symbolic regression here.
A very relevant thesis by sean Wouter is attached as a pdf document for your reading pleasure...
R side of things:
I've found Rgp package for genetic programming, here is a link.
Tile tool or Multi-field Binning tool for completing same task as Tile tool on multiple fields, splits the variables by 5 methods;
Equal Records or Intervals or Sums
Unfortunately "equal something" binnings are bad idea, as the values are categorized "blindly" irrespective of the effects on the predictive power of the models.
What to do:
What's needed is to bin both numerical and categorical variables optimally such that the Weights of Evidences (WoE) should present a monotone increasing or decreasing pattern. Maybe at most a V or U shaped "convex" structure.
Without constraining ourselves with monotonicity or convex cases, the easiest practice would be running a C4.5 or CHAID tree algorithm (produces multiple splits rather than binary splits in CART) for a single variable and select the target as the dependent variable and all the resulting nodes will be the bins we are looking for. Doing this for multiple variables at once is the key to the tool to be generated.
This capability is sought by risk management departments building robust, stable Basel compliant models in financial industry, especially by banks.
When scoring data if you have values in predictor fields not seen in the data that was used to build the model the score tool will not score the record. Makes sense but it would be nice to know how impactful the issue is. Please provide a count of records not scored for these reasons as well as a count of records not scored because of exceeding the limit in the configuration tab of the score tool. and a count for any other reason a record is not scored so we have a clear understanding of how many were scored and how many were not and why.
I have been using the outputs from Spline Regression to facillitate analysis of demographic data (specifically Department of Labor Quarterly Employment data). I have data from 1992Q1 to 2014Q1 and use Spline Regression to get fitted values for each quarter with predictors being the year/quarter, Year/quarter multiplied by a dummy variable for each of the 4 US Presidents, and a dummy variable for each president. So I can compare results across various groupings by geographic, and other levels as well as the BLS aggregation level. I can analyze raw data or have the values to be fitted indexed to 1992Q1. I use the default settings for Spline and it builds the best fit including where the node periods for each spline section. To help interpret the results, though, I use the output to compare the actual vs. fitted values (e.g. employment Level) and then look at the changes by quarter. With the spline regression building the best model with optimal line segments, the results make it possible to see how employment progress or regress correletat with with presidential terms of office or specific impacts of economic recessions on employment data.
I can supply an example of the process, if anyone is interested.
I'd appreciate any comments and/or suggestions to improve the process or interpret the results.