Performance of Summarize (group by) vs Unique
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi,
Which is more performant in speed and memory for use on a data set with a single column? In my client's use case, there are many different records to be grouped into four buckets.
On the documentation, it says the Unique tool "groups by one or more fields" - am I right in assuming the tool calls a hash function even for tables with one column, or does it perform a basic sort and compare under the hood?
Thanks,
Alex
- Labels:
- Best Practices
- Optimization
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
You could try both and turn on performance profiling to see which is more effective
https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Performance-Profiling-in-v10/td-p/3771
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Just as @Luke_C , I suggest to test these two tools by enabling the profiling performance as shown below to see which one is more efficient.
Otherwise, at the first glance in my opinion, I should say that Groupby will be more efficient because it will keep only categories grouped. However, the unique tool does the same but display all other fields of dataset for the same unique categories than groupby tool. So the number of rows will be the same for both but due to the fact that you display many more columns with Unique tool, I think Summarize wins in time. I did a test with dummy dataset you can find in attachement.
Let us know if it answers to your question.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @Emmanuel_G
Good job providing an actual sample, however my test has come up with the opposite result. The test workflow(Test.yxmd) is this
10M rows are generated and 7 fields are filled with random strings. The Unique and Summarize tools only deal with first two test fields. The tools in the Force Completion container are there to do something with the outputs of unique tool. With out these extra tools, the engine would halt the processing of the Unique tool after a time and give the "Processing was halted by a downstream tool" message. This doesn't occur in Desktop, but does on the server, or if run through the AlteryxRunner.exe command line.
I ran this workflow 10 times(TestRunner.yxmd) through the CReW List runner macro(Hence the AlteryxRunner.exe dependency) so I could have 10 runs worth of results to analyze(LogParser.yxmd).
Here are the summarized results of the ten runs
Both tools parse the 10M rows in about 8 seconds with the Unique running about 2% faster than the Summarize, but having a wider range of run times.
@alexcarreraTL: The difference in execution speed is negligible in large datasets, so use the tool that best suits your needs.
Dan
