This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
(I apologize if this question has already been asked, but if it was, I was not able to find it. I am still pretty new to Alteryx.)
My supervisor and I have some workflows that query our transaction lines table, which numbers in the tens of millions of records, so we are trying to optimize the performance of those queries.
If we wanted to query all transactions in 2018, it seems like we could do it in one of three ways:
Use (a) the "Input Data" tool to select the table, (b) the "Select" tool to select the fields we need, and (c) the "Filter" tool to filter by transaction date.
Use (a) the "Input Data" tool to select both the table and fields we need by using either the "Visual Query Builder" or "SQL Editor" and (b) the "Filter" tool to filter by transaction date.
Use just the "SQL Editor" in the "Input Data" tool to do all selecting or filtering using FROM or HAVING clauses.
(Typically, we would also join the transaction lines table with other tables, but for this question, I am really just interested in the input process.)
Of the three options, it seems like #3 might be the fastest from a computation perspective, but it is also the slowest and most tedious from a user perspective, even one who is relatively skilled with SELECT only SQL queries. #1 is clearly the quickest and easiest from a user's standpoint.
I guess my questions are...
Is #3 really the fastest? The only thing that makes me think it might not be is that when I run a workflow, I often see several tools "activate" at once, which makes me think that upstream and downstream tools interact and do not have a strict finish-to-start relationship.
If so, how much faster might #3 be on a data set that includes around 20M records and 100 or so fields? If it is just a small time difference, it is probably not worth the time it takes to set up.
Does Alteryx have a way of profiling the performance of different options (e.g., Tableau's performance recorder)?
First off, there is performance profiling available. In the workflow's Configuration window, there's a tab for "Runtime". At the bottom of that section is a checkbox to select Performance Profiling. Here's an article with more information.
Now to the original question, #3 should be the fastest option. I try to imagine data moving through the workflow: each time it passed through a tool, it has to be processed and moved in/out of memory, and so we should reduce that. In option #1, then data has to be moved from the table, to the select tool, then to the filter tool where it is reduced. Option #3 is the least amount of data movement that has to be processed/held in memory to achieve the same goal.
- It's my understanding that the green borders/"activation" can be a pretty rough estimate of what's going on in the engine processing. It's there for user feedback, but probably shouldn't be relied on. EDIT: If you ever wonder why some non-connected tools start before others, it has to do with the ToolID of those tools.
- When considering the "set up" vs "letting it all process" decision, I think it's important to consider if the field you need to filter on is indexed on the database. If it is, then use a WHERE clause in the query and take advantage of that index. Otherwise the filter tool might not be a big performance increase unless there are machine hardware differences limiting performance (Server host vs Alteryx host).