Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.
Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Engine improvement idea: symbolic referencing / indexing

SeanAdams
17 - Castor
17 - Castor

There's a common situation in Alteryx - where a key is used which is treated as a string - this key is used for joins, for sorts, filters.    These keys also need to be carried through the flow for every tool, which then creates a data transport cost

 

However - string keys are very very expensive - and the way that SQL engines optimize this is that they create an index of these and replace the strings with integers under the covers, and only convert back at the point of creating an output.

 

Would this be a feasible way to improve the performance of Alteryx, without changing the functionality?

6 REPLIES 6
MarqueeCrew
20 - Arcturus
20 - Arcturus

@SeanAdams ,

 

I'll comment in WhatsApp

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
TonyaS
Alteryx
Alteryx

Thanks @SeanAdams 

I wanted you to know that this has been added to our internal Intake for product ideas. 

Tonya Smith
Sr. Product Manager, cloud App Builder
ChrisK
Alteryx
Alteryx

Hi @SeanAdams 

 

I'm curious about your impression that strings are "very very expensive". In AMP strings are stored very compactly. Strings up to 127 UTF-8 bytes have only one byte of overhead. The cheapest string type to use is V_WString with a large size limit. When you use a String type we check that you don't have any wide characters in the string; with a WString type we don't need to do that check. When you use a small size limit we have to check that the data fits into the limit; when you use a large limit we don't need to do that check. All strings are stored as variable (with 1 byte overhead), so there is no wasted memory for using a large size limit. It is not like e1 where fixed sized strings avoided putting the string size into the record at the expense of always allocating the size limit.

 

I'd be interested in your measurements that explain what's expensive about strings. From knowing the internals, I expect strings to be pretty cheap. (Of course, if your string is 30 KB, that'll be expensive.)

 

Chris

SeanAdams
17 - Castor
17 - Castor

Hey @ChrisK - thank you for the question.

 

Yes - there's a fairly broad and rich base of papers on this which have been published over the years - for example the Dremel Paper that led to the creation of the Parquet data format; the papers from Microsoft's research team as part of the Vertipak analysis; etc.    The summary is that many analytical tabular data-sets have redundancy - especially for an ordinal type as a column, since the value is repeated many times on different rows.

 

The way that vector-based solutions tackle this problem is to look at the ordinal data elements and do on-the-fly compression using things like

  • Dictionary lookup.    If there are only 2 values in a column ("Satisfied" and "Unsatisfied") - then storing the full text in every row is not necessary, instead you can create a dictionary for his column which uses an 8bit int (or smaller) as an ID - and then in the main data set you store the value of the key (e.g. 0 for Satisfied; 1 for Unsatisfied)
  • Deltas / offsets: A date column can be more efficiently stored and processed as a starting date (a long int for the date) and variable length offsets from this date (variable length ints).

The great thing about this kind of vector-based thinking (or column-store thinking) is that you lose no functionality since you can always reconstruct - however during processing you see dramatic speed-increases.   For example - changing a table in MSSQL from Row-store to Column-store (using the Vertipaq engine) has repeatedly given us a 100x (two full orders of magnitude) speed up without changing anything else on the query or workload.

 

So storing data & moving it around as vectors with these kind of compression methods inherent to column-store is well known to be more efficient and dramatically smaller in size (as evidenced by all the research on column-store & vertex SQL mentioned above) for a large family of analytical workloads - the question is whether this would be a useful addition to AMP to help process large datasets; while still allowing memory, cache usage, paging etc to scale in a way that is dramatically sublinear.    It may be that AMP already uses the techniques outlined in the Dremel paper (or similar) to drive efficiency on the end-to-end pipeline - in which case that's outstanding news.

 

Thanks for the reply @ChrisK  - let me know if this makes sense, or if it's worth a voice-to-voice call to talk this through more deeply.

 

 

SeanAdams
17 - Castor
17 - Castor

Here's some links to talk through some of this:    Again, this is not about making any once cell small (for that you can easily use RLE, but that's not the big win) - the big win comes from looking at the entire dataset and looking for redundancies in data.
Columnstore indexes: Overview - SQL Server | Microsoft Learn
36632.pdf (googleusercontent.com)
The VertiPaq Engine in DAX | Microsoft Press Store

 

 

MarqueeCrew
20 - Arcturus
20 - Arcturus

Does this explain why Alteryx can't perform case insensitive joins?  Just wondering 💭 

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
Labels
Top Solution Authors