Alteryx Designer Desktop Discussions

SeanAdams · ‎12-19-2021

There's a common situation in Alteryx - where a key is used which is treated as a string - this key is used for joins, for sorts, filters. These keys also need to be carried through the flow for every tool, which then creates a data transport cost

However - string keys are very very expensive - and the way that SQL engines optimize this is that they create an index of these and replace the strings with integers under the covers, and only convert back at the point of creating an output.

Would this be a feasible way to improve the performance of Alteryx, without changing the functionality?

MarqueeCrew · ‎12-19-2021

@SeanAdams ,

I'll comment in WhatsApp

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.

TonyaS · ‎08-29-2022

Thanks @SeanAdams

I wanted you to know that this has been added to our internal Intake for product ideas.

Tonya Smith
Sr. Technical Product Manager, cloud App Builder

ChrisK · ‎09-30-2022

Hi @SeanAdams

I'm curious about your impression that strings are "very very expensive". In AMP strings are stored very compactly. Strings up to 127 UTF-8 bytes have only one byte of overhead. The cheapest string type to use is V_WString with a large size limit. When you use a String type we check that you don't have any wide characters in the string; with a WString type we don't need to do that check. When you use a small size limit we have to check that the data fits into the limit; when you use a large limit we don't need to do that check. All strings are stored as variable (with 1 byte overhead), so there is no wasted memory for using a large size limit. It is not like e1 where fixed sized strings avoided putting the string size into the record at the expense of always allocating the size limit.

I'd be interested in your measurements that explain what's expensive about strings. From knowing the internals, I expect strings to be pretty cheap. (Of course, if your string is 30 KB, that'll be expensive.)

Chris

SeanAdams · ‎09-30-2022

Hey @ChrisK - thank you for the question.

Yes - there's a fairly broad and rich base of papers on this which have been published over the years - for example the Dremel Paper that led to the creation of the Parquet data format; the papers from Microsoft's research team as part of the Vertipak analysis; etc. The summary is that many analytical tabular data-sets have redundancy - especially for an ordinal type as a column, since the value is repeated many times on different rows.

The way that vector-based solutions tackle this problem is to look at the ordinal data elements and do on-the-fly compression using things like

Dictionary lookup. If there are only 2 values in a column ("Satisfied" and "Unsatisfied") - then storing the full text in every row is not necessary, instead you can create a dictionary for his column which uses an 8bit int (or smaller) as an ID - and then in the main data set you store the value of the key (e.g. 0 for Satisfied; 1 for Unsatisfied)
Deltas / offsets: A date column can be more efficiently stored and processed as a starting date (a long int for the date) and variable length offsets from this date (variable length ints).

The great thing about this kind of vector-based thinking (or column-store thinking) is that you lose no functionality since you can always reconstruct - however during processing you see dramatic speed-increases. For example - changing a table in MSSQL from Row-store to Column-store (using the Vertipaq engine) has repeatedly given us a 100x (two full orders of magnitude) speed up without changing anything else on the query or workload.

So storing data & moving it around as vectors with these kind of compression methods inherent to column-store is well known to be more efficient and dramatically smaller in size (as evidenced by all the research on column-store & vertex SQL mentioned above) for a large family of analytical workloads - the question is whether this would be a useful addition to AMP to help process large datasets; while still allowing memory, cache usage, paging etc to scale in a way that is dramatically sublinear. It may be that AMP already uses the techniques outlined in the Dremel paper (or similar) to drive efficiency on the end-to-end pipeline - in which case that's outstanding news.

Thanks for the reply @ChrisK - let me know if this makes sense, or if it's worth a voice-to-voice call to talk this through more deeply.

SeanAdams · ‎09-30-2022

Here's some links to talk through some of this: Again, this is not about making any once cell small (for that you can easily use RLE, but that's not the big win) - the big win comes from looking at the entire dataset and looking for redundancies in data.
Columnstore indexes: Overview - SQL Server | Microsoft Learn
36632.pdf (googleusercontent.com)
The VertiPaq Engine in DAX | Microsoft Press Store

MarqueeCrew · ‎09-30-2022

Does this explain why Alteryx can't perform case insensitive joins? Just wondering 💭

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.

Alteryx Designer Desktop Discussions

Engine improvement idea: symbolic referencing / indexing