Many software & hardware companies take a very quantitative approach to driving their product innovation so that they can show an improvement over time on a standard baseline of how the product is used today; and then compare this to the way it can solve the problem in the new version and measure the improvement.
- Database vendors have been doing this for years using TPC benchmarks (http://www.tpc.org/) where a FIXED set of tasks is agreed as a benchmark and the database vendors then they iterate year over year to improve performance based on these benchmarks
- Graphics card companies or GPU companies have used benchmarks for years (e.g. TimeSpy; Cinebench etc).
How could this translate for Alteryx?
- Every year at Inspire - we hear the stats that say that 90-95% of the time taken is data preparation
- We also know that the reason for buying Alteryx is to reduce the time & skill level required to achieve these outcomes - again, as reenforced by the message that we're driving towards self-service analytics & Citizen-data-analytics.
Wouldn't it be great if Alteryx could say: "In the 2019.3 release - we have taken 10% off the benchmark of common tasks as measured by time taken to complete" - and show a 25% reduction year over year in the time to complete this battery of data preparation tasks?
One proposed method:
What would this give Alteryx?
This could be very simple to administer; and if done well it could give Alteryx:
- A clear and unambiguous marketing message that they are super-focussed on solving for the 90-95% of your time that is NOT being spent on analytics, but rather on data prep
- It would also provide focus to drive the platform in the direction of the biggest pain points - all the teams across the platform can then rally around a really deep focus on the user and accelerating their "time from raw data to analytics".
- A competitive differentiation - invite your competitors to take part too just like TPC.org or any of the other benchmarks
What this is / is NOT:
Loads of ways that this could be administered - starting point is to agree to drive this quantitatively on a fixed benchmark of tasks and data
I would like to see more files types supported to be able to be dragged from a folder onto a workflow. More precisely a .txt and a .dat file. This will greatly help my team and I do be able to analyze new and unknown data files that we receive on a daily basis.
Dear GUI Gurus,
A minor, but time saving GUI enhancement would be appreciated. When adding a tool to the canvas, the current behavior is to make visible the tool anchor that was last used on prior tools. That being said, when I look at the results window, I might be adding a "vanilla" configuration tool to the canvas and stare at a BLANK results window. When users are adding tools to the canvas, I suggest that the best practice is to VIEW the incoming data before configuring the tool.
I ALWAYS set the results to view the INCOMING DATA ANCHOR.
This minor change would be welcome to me.
Browse tool is really a powerful tool. We can see all information regarding datasets very rapidly.
Unfortunately, we only can export information (graphs, tables) manually through PNG files...
One major interest of Alteryx in Big Company is to perform DATA Quality reviews.
If we could export Browse tool informations (graphs, tables) automatically in pdf file or other solutions, we could save a lot of time in Data Quality tasks.
The only solution is to use DataViz tool or set up specific render in Alteryx (very time-consumming)
Main benefit would be the ability to share insights of DATA Quality with other business units.
Unsupervised learning method to detect topics in a text document.
Helpful for users interested in text mining.
Similar to the Select tool's Unknown Field Checkbox, I figured it would be useful for the Data Cleansing tool to have this functionality as well in order to avoid a scenario where after a cross-tab you have a new numeric field, one of which has a Null value, so you can't total up multiple fields because the Null value will prevent the addition from happening. If the Unknown Field box were checked off in the Data Cleansing tool then this problem would be avoided.
We don't have a seperate ANOVA tool in Alteryx, do you think of any reason?
It's not raw data or row blended data but insights gathered that's important:
Linear Regression Tool has a report for Type II ANOVA based on the model table we provide.
But both type II and other types are not available as standalone statistics tools...
Here is the list of different types of Anova that may be useful;
ANOVA models Definitions
|t-tests||Comparison of means between two groups; if independent groups, then independent samples t-test. If not independent, then paired samples t-test. If comparing one group against a fixed value, then a one-sample t-test.|
|One-way ANOVA||Comparison of means of three or more independent groups.|
|One-way repeated measures ANOVA||Comparison of means of three or more within-subject variables.|
|Factorial ANOVA||Comparison of cell means for two or more between-subject IVs.|
|Comparison of cells means for one or more between-subjects IV and one or more within-subjects IV.|
|ANCOVA||Any ANOVA model with a covariate.|
|MANOVA||Any ANOVA model with multiple DVs. Provides omnibus F and separate Fs.|
Looking forward for the addition of ANOVA tools to the data investigation tool box...
On 2019.2.5.62427, interactive results grid is only available for the embedded result window but not if you open the results in new window 'Open results in New Window' -> New Window
It also appears that interactive grid is also not available if you double click a yxdb file to open it and view the content.
Would be useful to have the interactive grid in both these areas instead of just the embedded result window.
This idea arose recently when working specifically with the Association Analysis tool, but I have a feeling that other predictive tools could benefit as well. I was trying to run an association analysis for a large number of variables, but when I was investigating the output using the new interactive tools, I was presented with something similar to this:
While the correlation plot draws your high to high associations, the user is unable to read the field names, and the tooltip only provides the correlation value rather than the fields with the value. As such, I shifted my attention to the report output, which looked like this:
While I could now read everything, it made pulling out the insights much more difficult. Wanting the best of both worlds, I decided to extract the correlation table from the R output and drop it into Tableau for a filterable, interactive version of the correlation matrix. This turned out to be much easier said than done. Because the R output comes in report form, I tried to use the report extract macros mentioned in this thread to pull out the actual values. This was an issue due to the report formatting, so instead I cracked open the macro to extract the data directly from the R output. To make a long story shorter, this ended up being problematic due to report formats, batch macro pathing, and an unidentifiable bug.
In the end, it would be great if there was a “Data” output for reports from certain predictive tools that would benefit from further analysis. While the reports and interactive outputs are great for ingesting small model outputs, at times there is a need to extract the data itself for further analysis/visualization. This is one example, as is the model coefficients from regression analyses that I have used in the past. I know Dr. Dan created a model coefficients macro for the case of regression, but I have to imagine that there are other cases where the data is desired along with the report/interactive output.
The sum function is probably the one I use most in the summarize tool. It is a silly thing, but it would be nice for "Sum" to be in the single-click list, rather than in the "Numeric" category...
Right now - if a tool generates an error - there is nothing productive that you can do with the error rows, these are just sent to the error log and depending on your settings the entire canvas will fail.
Could we change this in the Designer to work more like SSIS - where almost every tool has an error output, so that you can send the good rows one way, and the error rows the other way, and then continue processing? The error rows can be sent to an error table or workflow or data-quality service; and the good rows can be sent onwards. Because you have access to the error rows, you can also do run stats of "successful rows vs. unsuccessful"
This would make a big difference in the velocity of developing a canvas or prepping data.
Can take some screenshots if that helps?
With large tables it is tedious to search for a field. It would be a great efficiency gain to allow a user to search for a column in a table by entering a name or partial column name.
Often in larger workflows, I will copy data partway down the stream into a new workflow in order to troubleshoot a small section in order to avoid having to run the workflow over and over again which can take a while. I'm aware (and thankful) of cacheing, but sometimes if there are many parallel streams or, I'd rather just copy the data from the data preview built into the tool so I don't have to take the time to run the workflow again. I'm also aware I could output a yxdb file and use that, but again that takes longer than I would like.
The issue I run into is if I copy the data and paste in a text input tool, all the field types change to what they would default to. This is fine with new data, but for data that has specific fields throughout the workflow, this can be a hassle. If copying data could also copy the field type and size that would be great.
Python pandas dataframes and data types (numpy arrays, lists, dictionaries, etc.) are much more robust in general than their counterparts in R, and they play together much easier as well. Moreover, there are only a handful of packages that do everything a data scientist would need, including graphing, such as SciKit Learn, Pandas, Numpy, and Seaborn. After utliizing R, Python, and Alteryx, I'm still a big proponent of integrating with the Python language much like Alteryx has integrated with R. At the very least, I propose to create the ability to create custom code such as a Python tool.
One if the most common data-investigation tasks we have to do is comparing 2 data-sets. This may be making sure the columns are the same, field-name match, or even looking at row data. I think that this would be a tremendous addition to the core toolset. I've made a fairly good start on it, and am more than happy if you want to take this and extend or add to it (i give this freely with no claim on the work).
Very very happy to work with the team to build this out if it's useful
I wasted a good old chunk of time dealing with non-breaking spaces, and Alteryx could be improved by handling this automatically.
A space is a space, right? Nope, there are spaces (ASCII value decimal 32) and there are non-breaking spaces (ASCII value decimal 160). They look the same, but have slightly different behaviour in certain circumstances, like when text is auto-wrapped.
The DataCleansing tool cleans spaces, but leaves non-breaking spaces.
The Data Grid puts a warning on cells with leading or trailing spaces, but remains silent for non-breaking spaces.
I was trying to match two strings, that looked identical. I had DataCleansed my cells, and the grid was showing me nothing wrong with the data. In desperation, I copied the two data cells that I expected to match to a text editor (Textpad), and then examined the binary ASCII values of the data. One cell had a trailing non-breaking space, and that caused the failure to match.
This was hard to find. For someone less hopelessly nerdy, it would be practically impossible.
As a small change, it might be really useful for Alteryx to include non-breaking spaces in it's definition of "space", such that DataCleansing tool removes it, and the Data Grid flags up the cell as having a leading or trailing space.
You could pick up non-breaking spaces from HTML, or from Excel. I think mine came from a SQL script but I am not sure how it was there. They are out there, and they will bite.
One of the tools that I use the most is the SELECT tool because I normally get large data sets with fields that I won't be using for a specific analysis or with fields that need re-naming. In the same way, sometimes Alteryx will mark a field in a different type than the one I need (e.g. date field as string). That's when the SELECT comes in handy.
However, often times when dealing with multiple sources and having many SELECT tools on your canvas can make the workflow look a little "crowded". Not to mention adding extra tools that will need later explanation when presenting/sharing your canvas with others. That is why my suggestion is to give the CONNECTION tool "more power" by offering some of the functionality found in the SELECT tool.
For instance, if one of the most used features of the SELECT tool is to choose the fields that will move through the workflow, then may be we can make that feature available in the CONNECTION tool. Similarly, if one of the most used features (by Alteryx users) is to re-name fields or change the field type, then may be we can make that available in the CONNECTION tool as well.
At the end, developers can benefit from speeding up workflow development processes and end-users will benefit by having cleaner workflows presented to them, which always help to get the message across.
What do you guys think? Any of you feel the same? Leave your comments below.
Designer should support statistical testing tools that ignore data distribution and support Statistical Learning methods.
Alteryx already supports resampling for predictive modeling with Cross-Validation.
Resampling tools for bootstrap and permutation tests (supporting with or without replacement) should be tools for analysts and data scientists alike that assess random variability in a statistic without needing to worry about the restrictions of the data's distribution, as is the case with many parametric tests, most commonly supported by the t-test Tool in Alteryx. With modern computing power the need for hundred-year-old statistical sampling testing is fading: the power to sample a data set thousands of times to compare results to random chance is much easier today.
The tool's results could include, like R, outputs of not only the results histogram but the associated Q-Q plot that visualizes the distribution of the data for the analyst. This would duplicate the Distribution Analysis tool somewhat, but the Q-Q plot is, to me, a major missing element in the simplest visualization of data. This tool could be very valuable in terms of feeding the A/B Test tools.
A question has been coming up from several users at my workplace about allowing a column description to display in the Visual Query Builder instead of or along with the column name.
The column names in our database are based on an older naming convention, and sometimes the names aren't that easy to understand. We do see that (if a column does have a column description in metadata) it shows when hovering over the particular column; however, the consensus is that we'd like to reverse this and have the column description displayed with the column name shown on hover.
It would be a huge increase to efficiency and workflow development if this could be implemented.
There is a need when visualizing in-Database workflows to be able to visualize sorted data. This sorting could be done 1 of 2 ways: In a browse tool, or as a stand-alone Sort tool. Either would address the need. Without such a tool being present, the only way to sort the data is to "Data Stream Out" and then visualize the data in Alteryx. However, this process violates the premise of the usefulness of the in-DB toolkit, which is to keep your data in-DB and process using the DB engine. Streaming out big data in order to add a sort is not efficient.
Granted, the in-DB processing doesn't care whether data is sorted or not. However, when attempting to find extreme values after an aggregation, or when trying to identify something as simple as whether null values are present in a field, then a sort becomes extremely useful, and a necessary tool for human consumption of data (regardless of the database's processing needs).
Thanks very much for hearing my idea!