community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Sorting dataset affects data size

Highlighted

Hi all

 

I have noticed a few times that sorting a dataset prior to writing it out affects the file size, relative to not performing the sort.

 

I did a simple workflow and in this case the sorted file is 0.4 GB smaller.  I've noticed previously though, sometimes the sorted file is larger

 

Can anyone explain what is happening?  It doesn't make sense to me that sorting would affect the file size, let alone be able to make it both larger or smaller. 

 

I'd like to be able to control this...

 

Sorting Size.PNG

Bolide

Hi Rich_Stebles,

 

Looks like the output time has some capacity to reflect the order though indexing techniques.

 

When i run the same scenario into flat file outputs (csv) both input and 2 outputs feature the exact same size.

 

img.JPG

 

I believe the Alteryx database format is heavily indexed file based data store format and thus optimising for sort may affect the output size.

 

I presume only the Alteryx engineers will be able to precisely answer this, but this is my theory - most likely your setup not only affects the size, but also the retrieval time/operations on this data - i would expect these to be more optimised provided set of keys that align with how you analyse your data (filtering, summarize, etc.).

 

Thanks,

Rafal

 

#Excuse me, do you speak Alteryx?

Hi Rafal

 

Thanks for your reply.  I tested that out with my data just to see, and the CSVs came out the same size both ways as you highlight

 

So, definitely something to do with the YXDBs specifically.  It'd be really helpful to know how this works, some of my files can vary by 3 to 5 GB due to this.

 

Hopefully someone from the Alteryx team will see this and let us know.

 

Regards

Richard

Bolide

Yes, this is very interesting to know that your outputs may very by so much!

 

CSV is a very basic format and there is no place for any metadata, even field types are not stored - i always like to use Notepad++ to see the inside, as Excel may interpret things for you (without asking) like long number value - but basically this is WYSIWYG (what you see is what you get).

 

If feels like Alteryx database format has some invisible magic behind it and is heavily affected by sorting specifically, which in my thinking is linked with indexing - just wondering what else might be affecting this format?

 

Alteryx Engineering team - call for help here!?

Just to update, for any passing readers.  I reached out to Alteryx support directly on this, and the Customer Support Engineer contacted development. 

 

I would summarise the information we got back as broadly, "yes, this is the case, but we can't tell you why."  Different patterns in the data mean different compression; the more data that matches a pattern --> better compression.  But development couldn't provide much information on what these patterns actually are, despite specifically designing the software to identify them.  So we couldn't get to a best practice guide to sorting a dataset for maximum compression.  

 

The exact information we got back was:

 

"The compression algorithm will scan the data for patterns in which is will compress on, however certain data types will perform differently. For example, string fields generally have patterns and will compress 'better', whereas binary fields will have less pattern and therefore are not expected to compress as well. Additionally, some languages have more patterns that can impact compression (i.e. English will compress better than Arabic depending on the compression method)."

 

"The patterns in the data will change with each dataset, so it is not possible for even a workflow that compressed, say 50%, to compress that same amount (or even close to the same amount) even with the same workflow configuration on the next run."

Labels