There's a common situation in Alteryx - where a key is used which is treated as a string - this key is used for joins, for sorts, filters. These keys also need to be carried through the flow for every tool, which then creates a data transport cost
However - string keys are very very expensive - and the way that SQL engines optimize this is that they create an index of these and replace the strings with integers under the covers, and only convert back at the point of creating an output.
Would this be a feasible way to improve the performance of Alteryx, without changing the functionality?
Solved! Go to Solution.
I'll comment in WhatsApp
Thanks @SeanAdams
I wanted you to know that this has been added to our internal Intake for product ideas.
Hi @SeanAdams
I'm curious about your impression that strings are "very very expensive". In AMP strings are stored very compactly. Strings up to 127 UTF-8 bytes have only one byte of overhead. The cheapest string type to use is V_WString with a large size limit. When you use a String type we check that you don't have any wide characters in the string; with a WString type we don't need to do that check. When you use a small size limit we have to check that the data fits into the limit; when you use a large limit we don't need to do that check. All strings are stored as variable (with 1 byte overhead), so there is no wasted memory for using a large size limit. It is not like e1 where fixed sized strings avoided putting the string size into the record at the expense of always allocating the size limit.
I'd be interested in your measurements that explain what's expensive about strings. From knowing the internals, I expect strings to be pretty cheap. (Of course, if your string is 30 KB, that'll be expensive.)
Chris
Hey @ChrisK - thank you for the question.
Yes - there's a fairly broad and rich base of papers on this which have been published over the years - for example the Dremel Paper that led to the creation of the Parquet data format; the papers from Microsoft's research team as part of the Vertipak analysis; etc. The summary is that many analytical tabular data-sets have redundancy - especially for an ordinal type as a column, since the value is repeated many times on different rows.
The way that vector-based solutions tackle this problem is to look at the ordinal data elements and do on-the-fly compression using things like
The great thing about this kind of vector-based thinking (or column-store thinking) is that you lose no functionality since you can always reconstruct - however during processing you see dramatic speed-increases. For example - changing a table in MSSQL from Row-store to Column-store (using the Vertipaq engine) has repeatedly given us a 100x (two full orders of magnitude) speed up without changing anything else on the query or workload.
So storing data & moving it around as vectors with these kind of compression methods inherent to column-store is well known to be more efficient and dramatically smaller in size (as evidenced by all the research on column-store & vertex SQL mentioned above) for a large family of analytical workloads - the question is whether this would be a useful addition to AMP to help process large datasets; while still allowing memory, cache usage, paging etc to scale in a way that is dramatically sublinear. It may be that AMP already uses the techniques outlined in the Dremel paper (or similar) to drive efficiency on the end-to-end pipeline - in which case that's outstanding news.
Thanks for the reply @ChrisK - let me know if this makes sense, or if it's worth a voice-to-voice call to talk this through more deeply.
Here's some links to talk through some of this: Again, this is not about making any once cell small (for that you can easily use RLE, but that's not the big win) - the big win comes from looking at the entire dataset and looking for redundancies in data.
Columnstore indexes: Overview - SQL Server | Microsoft Learn
36632.pdf (googleusercontent.com)
The VertiPaq Engine in DAX | Microsoft Press Store
Does this explain why Alteryx can't perform case insensitive joins? Just wondering 💭