This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Looks like the output time has some capacity to reflect the order though indexing techniques.
When i run the same scenario into flat file outputs (csv) both input and 2 outputs feature the exact same size.
I believe the Alteryx database format is heavily indexed file based data store format and thus optimising for sort may affect the output size.
I presume only the Alteryx engineers will be able to precisely answer this, but this is my theory - most likely your setup not only affects the size, but also the retrieval time/operations on this data - i would expect these to be more optimised provided set of keys that align with how you analyse your data (filtering, summarize, etc.).
Yes, this is very interesting to know that your outputs may very by so much!
CSV is a very basic format and there is no place for any metadata, even field types are not stored - i always like to use Notepad++ to see the inside, as Excel may interpret things for you (without asking) like long number value - but basically this is WYSIWYG (what you see is what you get).
If feels like Alteryx database format has some invisible magic behind it and is heavily affected by sorting specifically, which in my thinking is linked with indexing - just wondering what else might be affecting this format?
Just to update, for any passing readers. I reached out to Alteryx support directly on this, and the Customer Support Engineer contacted development.
I would summarise the information we got back as broadly, "yes, this is the case, but we can't tell you why." Different patterns in the data mean different compression; the more data that matches a pattern --> better compression. But development couldn't provide much information on what these patterns actually are, despite specifically designing the software to identify them. So we couldn't get to a best practice guide to sorting a dataset for maximum compression.
The exact information we got back was:
"The compression algorithm will scan the data for patterns in which is will compress on, however certain data types will perform differently. For example, string fields generally have patterns and will compress 'better', whereas binary fields will have less pattern and therefore are not expected to compress as well. Additionally, some languages have more patterns that can impact compression (i.e. English will compress better than Arabic depending on the compression method)."
"The patterns in the data will change with each dataset, so it is not possible for even a workflow that compressed, say 50%, to compress that same amount (or even close to the same amount) even with the same workflow configuration on the next run."