This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Many software & hardware companies take a very quantitative approach to driving their product innovation so that they can show an improvement over time on a standard baseline of how the product is used today; and then compare this to the way it can solve the problem in the new version and measure the improvement.
- Database vendors have been doing this for years using TPC benchmarks (http://www.tpc.org/) where a FIXED set of tasks is agreed as a benchmark and the database vendors then they iterate year over year to improve performance based on these benchmarks
- Graphics card companies or GPU companies have used benchmarks for years (e.g. TimeSpy; Cinebench etc).
How could this translate for Alteryx?
- Every year at Inspire - we hear the stats that say that 90-95% of the time taken is data preparation
- We also know that the reason for buying Alteryx is to reduce the time & skill level required to achieve these outcomes - again, as reenforced by the message that we're driving towards self-service analytics & Citizen-data-analytics.
Wouldn't it be great if Alteryx could say: "In the 2019.3 release - we have taken 10% off the benchmark of common tasks as measured by time taken to complete" - and show a 25% reduction year over year in the time to complete this battery of data preparation tasks?
One proposed method:
Take an agreed benchmark set of tasks / data / problems / outcomes, based on a standard data set - these should include all of the common data preparation problems that people face like date normalization; joining; filtering; table sync (incremental sync as well as dump-and-load); etc.
Measure the time it takes users to complete these data-prep/ data movement/ data cleanup tasks on the benchmark data & problem set using the latest innovations and tools
This time then becomes the measure - if it takes an average user 20 mins to complete these data prep tasks today; and in the 2019.3 release it takes 18 mins, then we've taken 10% off the cost of the largest piece of the data analytics pipeline.
What would this give Alteryx?
This could be very simple to administer; and if done well it could give Alteryx:
- A clear and unambiguous marketing message that they are super-focussed on solving for the 90-95% of your time that is NOT being spent on analytics, but rather on data prep
- It would also provide focus to drive the platform in the direction of the biggest pain points - all the teams across the platform can then rally around a really deep focus on the user and accelerating their "time from raw data to analytics".
- A competitive differentiation - invite your competitors to take part too just like TPC.org or any of the other benchmarks
What this is / is NOT:
This is not a run-time measure - i.e. this is not measuring transactions or rows per second
This should be focussed on "Given this problem; and raw data - what is the time it takes you, and the number of clicks and mouse moves etc - to get to the point where you can take raw data, and get it prepped and clean enough to do the analysis".
This should NOT be a test of "Once you've got clean data - how quickly can you do machine learning; or decision trees; or predictive analytics" - as we have said above, that is not the big problem - the big problem is the 90-95% of the time which is spent on data prep / transport / and cleanup.
Loads of ways that this could be administered - starting point is to agree to drive this quantitatively on a fixed benchmark of tasks and data
I would like to see more files types supported to be able to be dragged from a folder onto a workflow. More precisely a .txt and a .dat file. This will greatly help my team and I do be able to analyze new and unknown data files that we receive on a daily basis.
A minor, but time saving GUI enhancement would be appreciated. When adding a tool to the canvas, the current behavior is to make visible the tool anchor that was last used on prior tools. That being said, when I look at the results window, I might be adding a "vanilla" configuration tool to the canvas and stare at a BLANK results window. When users are adding tools to the canvas, I suggest that the best practice is to VIEW the incoming data before configuring the tool.
I ALWAYS set the results to view the INCOMING DATA ANCHOR.
Often in larger workflows, I will copy data partway down the stream into a new workflow in order to troubleshoot a small section in order to avoid having to run the workflow over and over again which can take a while. I'm aware (and thankful) of cacheing, but sometimes if there are many parallel streams or, I'd rather just copy the data from the data preview built into the tool so I don't have to take the time to run the workflow again. I'm also aware I could output a yxdb file and use that, but again that takes longer than I would like.
The issue I run into is if I copy the data and paste in a text input tool, all the field types change to what they would default to. This is fine with new data, but for data that has specific fields throughout the workflow, this can be a hassle. If copying data could also copy the field type and size that would be great.
Similar to the Select tool's Unknown Field Checkbox, I figured it would be useful for the Data Cleansing tool to have this functionality as well in order to avoid a scenario where after a cross-tab you have a new numeric field, one of which has a Null value, so you can't total up multiple fields because the Null value will prevent the addition from happening. If the Unknown Field box were checked off in the Data Cleansing tool then this problem would be avoided.
I have seen the Browse tool offering a basic level of profiling results in the profile table and also a basic data profile tool under Investigation category. But both of them lack the pattern profiling option. I would like to see a pattern profiling option inside Alteryx too, which can show the pattern distribution of column data something like below (This is from SQL Data Profile viewer).
This can be very helpful in checking the data quality, by picking up data anomalies and checking inconsistencies.
It would often be very useful to have the ability to search for a field in a browse too.
At the moment i don't think there's an easy way to manually trace data through a workflow
For example you have created a workflow with various Joins, filters, etc. and notice that the final output is missing data for "ABC limited". The only way to find at what step ABC limited dropped out of the workflow is to add 10 filter tools branching out from before and after each step in the workflow's logic then re-run the workflow (which might take 5-10 minutes) to see if where "ABC limited" has gone. You fix the problem "ABC ltd" didn't join to "ABC Limited", but now you want to also check for XYZ limited so you have to manually edit all 10 filter tools. It seems you have fixed the problem, but now your workflow is a mess of 10 filter tools.
Alternatively you could copy and paste the data from every browse tool into an excel workbook and use their search function instead, but that's obviously a cumbersome and unhelpful process, particularly as the excel sheet will have to be remade with every run of the workflow.
You could also use sort tools throughout before a browse tool, but that is still slow and doesn't help with cases where "ABC Ltd" is matching to "The ABC Co ltd"
Perhaps it would be much easier to just have a small search box in every browse tool?
Or is there a feature that I'm not aware of that makes this process of quality checking your workflow easier already?
There is a need when visualizing in-Database workflows to be able to visualize sorted data. This sorting could be done 1 of 2 ways: In a browse tool, or as a stand-alone Sort tool. Either would address the need. Without such a tool being present, the only way to sort the data is to "Data Stream Out" and then visualize the data in Alteryx. However, this process violates the premise of the usefulness of the in-DB toolkit, which is to keep your data in-DB and process using the DB engine. Streaming out big data in order to add a sort is not efficient.
Granted, the in-DB processing doesn't care whether data is sorted or not. However, when attempting to find extreme values after an aggregation, or when trying to identify something as simple as whether null values are present in a field, then a sort becomes extremely useful, and a necessary tool for human consumption of data (regardless of the database's processing needs).
Right now - if a tool generates an error - there is nothing productive that you can do with the error rows, these are just sent to the error log and depending on your settings the entire canvas will fail.
Could we change this in the Designer to work more like SSIS - where almost every tool has an error output, so that you can send the good rows one way, and the error rows the other way, and then continue processing? The error rows can be sent to an error table or workflow or data-quality service; and the good rows can be sent onwards. Because you have access to the error rows, you can also do run stats of "successful rows vs. unsuccessful"
This would make a big difference in the velocity of developing a canvas or prepping data.
This feature isn't a must - but would definitely be a nice to have.
Similar to the excel having a tab with key figures like average, count and sum
It would be a really good idea to do something similar within Alteryx just to have a quick glance on key figures/functions (example attached - apologise for the bad paint job but definitely would look good with Alteryx colour scheme)
It is disorienting when I am creating string fields explicitly coded to a specific length, then viewing my results window and having the values not line up across rows. If a font like courier were added as an option, this could be avoided.
Would be nice to have the option of disabling the append of the "action" to the variable in the summarize tool. Sometimes it's useful to leave the variable name as is when making tweeks to your module.
It would be great if there was an option to compute 'median' on numerical data column in 'cross-tab' tool. We trust 'median' a lot more than 'average' in many different computations. I would stretch my suggestion far enough to propose adding quantile computations as well...
In the histogram tool, I would like the ability to specify the bins, not just the number of bins, but the values of the bins. That would be especially helpful when comparing different data sets when I want to see an apples to apples comparison across two different histograms.
For those with large web and streaming-media server logs, the ability to geocode IP addresses be an excellent feature, similar to Alteryx's ability go geocode street addresses. Several IP geocoding services exist, with different levels of accuracy and cost. Ideally, the user should be able to choose their own service if they have one, in addition to a default service built-in to Alteryx.