We have discussed on several occasions and in different forums, about the importance of having or providing Alteryx with order of execution control, conditional executions, design patterns and even orchestration.
I presented this idea some time ago, but someone asked me if it was posted, and since it was not, I’m putting it here so you can give some feedback on it.
The basic concept behind this idea is to allow us (users) to have:
This approach involves some functionalities that are already within the product (like exploiting Filtering logic, loading & saving, caching, blocking among others), exposed within a Tool Container with enhanced attributes, like this example:
The approach is to extend Tool Container’s attributes.
This proposition uses actual functionalities we already have in Designer.
So, basically, the Tool Container gets ‘superpowers’, with the addition of some capabilities like: Accepting input data, saving the contents within the container (to create a design pattern, or very commonly used sequence of tools chained together), output data, run the contents of the tools included in the container, etc.), plus a configuration screen like:
This should end a brief introduction to the idea, but taking it a little further, it will allow even to have something like an Orchestration layout, where the users can drag and drop containers or patterns and orchestrate them in a solution, like we can do with the Visual Layout Tool or the Interactive Chart tool:
I'm looking forward to hear what you think.
This has probably been mentioned before, but in case it hasn't....
Right now, if the dynamic input tool skips a file (which it often does!) it just appears as a warning and continues processing. Whilst this is still useful to continue processing, could it be built as an option in the tool to select a 'error if files are skipped'?
Right now it is either easy to miss this is happening, or in production / on server you may want this process to be stopped.
I surprisingly couldn't find this anywhere else as I know it's been discussed in person on many occasions.
Basically the Formula tool needs to be smarter in many ways, but this particular post focuses on the Data Type component.
The formula tool, should not always default to V_String as the data type when entering data or a formula into the formula tool, it should look at the data type and estimate the most likely option.
I know there are times where the logical type might not be consistent in all fields, but the Data Preview and the Function of the formula should be used to determine the most likely option.
E.G. If I type a number or a date directly into the formula tool, then Alteryx should be smart enough to change the data type from the standard V_String to Int, Double or date.
This is an extension to the ideas posted here:
I often need to create a record ID that automatically increments but grouped by a specific field. I currently do it using the Multi-Row Formula tool doing [Field-1:ID]+1 because there is no group by option in the Record ID tool.
Also, sometimes I need to start at 0 but the Multi-Row Formula tool doesn't allow this so I have to use a Formula tool right after to subtract 1.
So adding a group by option to the Record ID tool would allow the user not to use the multi-row formula to do this and to start at any value wanted.
Love the new updates to the Browse tool in 2019.2! However, if you choose the option Open results in new window, which I do often so I can see my whole dataset, the search/filter/sort functionality goes away. Would be great if that new functionality also worked in the new window. Thanks!
Can't wait for the new base maps!
In-app screens, lot of space is wasted because components/tools can just be stacked one below the other.
It would great if we could also insert them horizontally.
Tags : screen, app, macro, layout, tools, UI
I am recommending/requesting that Alteryx add an XG Boost tool written in R to the Predictive toolset. I have just finished productionizing a classification model using the Boosted tool and while it performs well, I derived better predictions using XGBoost (in another software on the same data). I am aware that the Intelligence Suite now has XGB in it...but that is at an additional cost and, quite frankly, after having tested it, more difficult to productionize.
A couple of points:
1. Every other toolset that I have used has an XGBoost algorithm as part of the standard package (SPSS Modeler, Statistica, RapidMiner);
2. XGoost is arguably the leading algorithm out there for many/most classification problems; it has been in the winning solution in a disproportionate number of Kaggle competitions (40%-50%?);
3. It is a bit gut-wrenching to tell my colleagues, "No, there is not a supported XGB tool in Alteryx" when that is seen as a litmus test for a DS platform or tool.
4. Yes, we can use R or Python and that is precisely what I will do in iteration 2 of the model...but having the tool already exist would save significant time, especially when building and testing models.
Thank you for the consideration!
I would like to request that the Python tool metadata either be automatically populated after the code has run once, or a simple line of code added in the tool to output the metadata. Also, the metadata needs to be cached just like all of the other tools.
As it sits now, the Python tool is nearly unusable in a larger workflow. This is because it does not save or pass metadata in a workflow. Most other tools cache temporary metadata and pass it on to the next tool in line. This allows for things like selecting columns and seeing previews before the workflow is run.
Each time an edit is made to the workflow, the workflow must be re-run to update everything downstream of the Python tool. As you can imagine, this can get tedious (unusable) in larger workflows.
Alteryx support has replied with "this is expected behavior" and "It is giving that error because Alteryx is
doing a soft push for the metadata but unfortunately it is as designed."
I'm really liking the new assisted modelling capabilities released in 2020.2, but it should not error if the data contains: spatial, blob, date, datetime, or datetime types.
This is essentially telling the user to add an extra step of adding a select before the assisted modelling tool and then a join after the models. I think the tool should be able to read in and through these field types (especially dates) and just not use them in any of the modelling.
An even better enhancement would be to transform date as part of the assisted modelling into something usable for the modelling (season, month, day of week, etc.)
Sometimes, as a sanity check, I would like to be able to model only the mean of my data set, i.e. I would like to use a predictive tool with no predictors included. The result would be a model with only an intercept, and this value would be the mean of the target variable. This would not be an important feature for final models, of course, but when starting to look at a data set and build up a model, it can be useful to first ensure the model is producing the expected output in the simplest case.
Note, this can be achieved when just one predictor is included, but it takes some math (see below), so it would be nice to be able to have this as a built-in option.
Unsupervised learning method to detect topics in a text document.
Helpful for users interested in text mining.
This idea arose recently when working specifically with the Association Analysis tool, but I have a feeling that other predictive tools could benefit as well. I was trying to run an association analysis for a large number of variables, but when I was investigating the output using the new interactive tools, I was presented with something similar to this:
While the correlation plot draws your high to high associations, the user is unable to read the field names, and the tooltip only provides the correlation value rather than the fields with the value. As such, I shifted my attention to the report output, which looked like this:
While I could now read everything, it made pulling out the insights much more difficult. Wanting the best of both worlds, I decided to extract the correlation table from the R output and drop it into Tableau for a filterable, interactive version of the correlation matrix. This turned out to be much easier said than done. Because the R output comes in report form, I tried to use the report extract macros mentioned in this thread to pull out the actual values. This was an issue due to the report formatting, so instead I cracked open the macro to extract the data directly from the R output. To make a long story shorter, this ended up being problematic due to report formats, batch macro pathing, and an unidentifiable bug.
In the end, it would be great if there was a “Data” output for reports from certain predictive tools that would benefit from further analysis. While the reports and interactive outputs are great for ingesting small model outputs, at times there is a need to extract the data itself for further analysis/visualization. This is one example, as is the model coefficients from regression analyses that I have used in the past. I know Dr. Dan created a model coefficients macro for the case of regression, but I have to imagine that there are other cases where the data is desired along with the report/interactive output.
Would be extremely useful if the Summarize Tool had an option in the numeric menu to Standardize the data. More often than not, data sets will not have the same count of variables which makes the comparison analysis meaningless. Currently, there is no easy way to Standardize the data without using the K-Centroids Cluster Analysis tool or standardize_unit interval supporting macro.
So - with Challenge 111 - many folk used the Optimization tool
… and Joe has done a great training on this here
But it's still to hard to use. It requires you to have pre-knowledge of a bunch of parameters and different types of knowledge.
Can we improve the interface on this tool so that it can be used by folk who do not have a background in R - for example, take all the different inputs, and make them parameterized on drop-down boxes or input boxes on the tool?
Thank you all
Python pandas dataframes and data types (numpy arrays, lists, dictionaries, etc.) are much more robust in general than their counterparts in R, and they play together much easier as well. Moreover, there are only a handful of packages that do everything a data scientist would need, including graphing, such as SciKit Learn, Pandas, Numpy, and Seaborn. After utliizing R, Python, and Alteryx, I'm still a big proponent of integrating with the Python language much like Alteryx has integrated with R. At the very least, I propose to create the ability to create custom code such as a Python tool.
It would be nice if this option would take you to the correct download page relative to the version the user has installed. Currently, this always loads the download page for the current version which is confusing for users of a company who are still required to use an older version.
When working with R code and errors occur, the application needs to show which line the error happened on.
Up to version 10.0 I could open pretty much all analytics tools as a macro, to tweak things in R or in the macro workflow to get the results in a way most useful to us.
But apparently with Alteryx 11.0 the newer tools does not have that option, Although we can still access the older versions of those tools and still open them as macro but I don't understand (may be because they have interactive report option) why that is being killed in the newer versions?
Most of the newer versions have new features, like Linear Regression now support elastic net and cross validation etc.. but I still want to be able to go in to them to tweak them.
Designer should support statistical testing tools that ignore data distribution and support Statistical Learning methods.
Alteryx already supports resampling for predictive modeling with Cross-Validation.
Resampling tools for bootstrap and permutation tests (supporting with or without replacement) should be tools for analysts and data scientists alike that assess random variability in a statistic without needing to worry about the restrictions of the data's distribution, as is the case with many parametric tests, most commonly supported by the t-test Tool in Alteryx. With modern computing power the need for hundred-year-old statistical sampling testing is fading: the power to sample a data set thousands of times to compare results to random chance is much easier today.
The tool's results could include, like R, outputs of not only the results histogram but the associated Q-Q plot that visualizes the distribution of the data for the analyst. This would duplicate the Distribution Analysis tool somewhat, but the Q-Q plot is, to me, a major missing element in the simplest visualization of data. This tool could be very valuable in terms of feeding the A/B Test tools.
XGboost regression is now the benchmark for every Kaggle competition and seems to consistently outperform random forest, spline regression, and all of the more basic models. For those of us using predictive modeling on a regular basis in our actual work, this tool would allow for a quick improvement in our model accuracy. And I think, from a marketing standpoint, having a core group of users competing in Kaggle using Alteryx would be a great way to show off Alteryx's power.
It is readily available as an R package: https://cran.r-project.org/web/packages/xgboost/index.html
I think the Nearest Neighbor Algorithm is one of the least used, and most powerful algorithms I know of. It allows me to connect data points with other data points that are similar. When something is unpredictable, or I simply don't have enough data, this allows me to compare one data point with its nearest neighbors.
So, last night I was at school, taking a graduate level Econ course. We were discussing various distance algorithms for a nearest neighbor algorithm. Our prof discussed one called the Mahalanobis distance. It uses some fancy matrix algebra. Essentially it allows it it to filter out the noise, and only match on distance algorithms that are truly significant. It takes into account the correlation that may exists within variables, and reduces those variables down to only one.
I use Nearest Neighbor when other things aren't working for me. When my data sets are weak, sparse, or otherwise not predictable. Sometimes I don't know that particular variables are correlated. This is a powerful algorithm that could be added into the Nearest Neighbor, to allow for matches that might not otherwise be found. And allow matches on only the variables that really matter.
I checked out the "Boosted" model and see that it basically wraps the "gbm" model in R. I would like to request a similar wrapping for the newer xgb (or xgboost) -- eXtreme Gradient Boosting, which is very fast and accurate, and is winning Kaggle competitions left and right. It would be a great addition and is something SAS probably won't have it for another 10 years, if ever.
I would like to share some feedback regarding the Principal Component tool.
I've selected the option "Scale each field to have unit variance" and 1 of the 4 PCA tools was displaying errors. However, the error message is not very intuitive and I couldn't use it to debug my workflow. The problem was that for my type of data, scaling could not be applied since it had a lot of 0 values.
Couldn't find anything related to this, so hope my feedback helps others.
Similar to @aselameab1 - I was having trouble with using the Linear regression tool because it was giving error messages that were not explanatory or self descriptive.
@chadanaber identified the issue - that a specific field only had one unique value which was causing the regression tool to fail - however the error message provided gives no useful or helpful indication that this is the issue. You can see that the error message below is pretty tough to understand.
Could we add an item to the development backlog to add defensive checks to the predictive analytics tools to check for conditions that will cause them to fail, and rework the error messaging?
I've attached the workflow with the sample data that replicates this issue
A lot of popular machine learning systems use a computer's GPU to speed up some of the math to a huge degree. The header on this article on Medium shows a 15x difference from a high-end CPU vs a high-end GPU. It could also create an improvement in the spatial tools. Perhaps Alteryx should add this functionality in order to speed up these tools, which I can imagine are currently some of the slowest.