Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Performance of Summarize (group by) vs Unique

alexcarreraTL
5 - Atom

Hi,

 

Which is more performant in speed and memory for use on a data set with a single column? In my client's use case, there are many different records to be grouped into four buckets.

 

On the documentation, it says the Unique tool "groups by one or more fields" - am I right in assuming the tool calls a hash function even for tables with one column, or does it perform a basic sort and compare under the hood?

 

Thanks,

Alex

 

 

3 REPLIES 3
Luke_C
17 - Castor

Hi @alexcarreraTL 

 

You could try both and turn on performance profiling to see which is more effective

https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Performance-Profiling-in-v10/td-p/3771

 

Emmanuel_G
13 - Pulsar

Hi @alexcarreraTL 

 

Just as @Luke_C , I suggest to test these two tools by enabling the profiling performance as shown below to see which one is more efficient.

Emmanuel_G_0-1667728491961.png

 

Otherwise, at the first glance in my opinion, I should say that Groupby will be more efficient because it will keep only categories grouped. However, the unique tool does the same but display all other fields of dataset for the same unique categories than groupby tool. So the number of rows will be the same for both but due to the fact that you display many more columns with Unique tool, I think Summarize wins in time. I did a test with dummy dataset you can find in attachement.

 

Emmanuel_G_1-1667728845114.png

 

Let us know if it answers to your question.

 

danilang
19 - Altair
19 - Altair

Hi @Emmanuel_G 

 

Good job providing an actual sample, however my test has come up with the opposite result.  The test workflow(Test.yxmd) is this

 

danilang_0-1667733625637.png

10M rows are generated and 7 fields are filled with random strings.  The Unique and Summarize tools only deal with first two test fields.  The tools in the Force Completion container are there to do something with the outputs of unique tool.  With out these extra tools, the engine would halt  the processing of the Unique tool after a time and give the "Processing was halted by a downstream tool" message.  This doesn't occur in Desktop, but does on the server, or if run through the AlteryxRunner.exe command line.   

 

I ran this workflow 10 times(TestRunner.yxmd) through the CReW List runner macro(Hence the AlteryxRunner.exe dependency) so I could have 10 runs worth of results to analyze(LogParser.yxmd).

 

Here are the summarized results of the ten runs

danilang_1-1667734550891.png

Both tools parse the 10M rows in about 8 seconds with the Unique running about 2% faster than the Summarize, but having a wider range of run times.

 

@alexcarreraTL:  The difference in execution speed is negligible in large datasets, so use the tool that best suits your needs.

 

Dan 

   

Labels