Weekly Challenge

Solve the challenge, share your solution and summit the ranks of our Community!
IDEAS WANTED

We're actively looking for ideas on how to improve Weekly Challenges and would love to hear what you think!

Submit Feedback
We've recently made an accessibility improvement to the community and therefore posts without any content are no longer allowed. Please use the spoiler feature or add a short message in the message body in order to submit your weekly challenge.

Challenge #131: Think Like a CSE... The R Error Message: cannot allocate vector of size...

Highlighted
11 - Bolide
11 - Bolide
Spoiler
Like others have said the error is showing that it is a memory issue, therefore the solution is to batch the scoring of the data into chunks instead of trying to do it all in one go.

The tile tool is really good for this, and I feel this is an underused tool in my toolset.


Challenge 131b.PNGChallenge 131a.PNG
Highlighted
Alteryx
Alteryx

Some thoughts below. I'll append the note in a text file so I have something attached to my response. 

 

Spoiler
I've been doing some reading on the clustering tools - haven't had to use them yet. As I understand it, the clustering tool defines the clustering model and the append cluster tool assigns elements to the clusters. If that's the case and the tool is running out of memory because we're throwing too much at it, couldn't we just put the append tool into a batch macro and let it process the data in bite-size chunks that fit in memory? Just looked at the spoiler while writing this and sounds like I'm at least close to what the CSE came up with.
Highlighted
8 - Asteroid

Done. 

 

Spoiler
As others have discussed, this is all about the size of the data being used as an input into the append clusters. Best way around is to link up the append clusters to the out from the select tool. As the create clusters is working ok, this suggests that the clusters can be made, its just there may be too many large fields in the raw data. 
Highlighted
Alteryx Certified Partner

Here's my solution:

 

Spoiler

cannot allocate vector of size 7531.1 Gb

Looks like a memory issue in the append cluster tool. 
I made a workflow to see if batching the process of the append cluster tool affected it
Spoiler
it doesn't

challenge 131.png

I would advise using a batch macro (like above) to only feed a subset of data into the append cluster tool at a time - hopefully solving the error.

 

 

Highlighted
8 - Asteroid

Suggestions below.

Spoiler
From looking online, this is a RAM limitation issue. R works with data in memory. 7.5k Gb is way more than you will be able to put in your machine or even commercial cloud rentals. Processing smaller chunks of the data iteratively may be possible using macros.

 

 

 

Highlighted
13 - Pulsar
Spoiler
Similar to other solutions - The issue is the lack of memory to complete this workflow as is.  Adding memory would be problematic due to size requirements and inefficient use of resources.  Better solution is to chunk the data into manageable pieces, which lends itself to a batch macro solution.  Tested batch macro with smaller hockey data set I had kicking around to verify batch macro returned the same results.
Spoiler
Macro 131.png

 

Workflow 131.png

 

Highlighted
8 - Asteroid

Memory issued on append cluster, which means the clustering is working but the labeling isn't. So we can batch the latter.

 

Nice to know!

Highlighted
8 - Asteroid
 
Highlighted
8 - Asteroid
Spoiler

As an Alteryx CSE I would investigate the error message and the tools used in the workflow and quickly find out:

 

"Error messages beginning 'cannot allocate vector of size' indicate a failure to obtain memory, either because the size exceeded the address-space limit for a process or, more likely, because the system was unable to provide the memory."

 

Then I'd ask the community and they'd say to use a batch macro since it's only the Append Cluster tool that's throwing the error and I'd send them this example of a batch macro workflow using the Append Cluster tool. It uses the Tile Tool to create X number of tiles/batches, which then runs those batches through.

 

I would also mention to the user that he/she should consider using the K-Centroids Diagnostics tool, which will output a 'K-Means Cluster Assessment Report'. Per the tool mastery article, "The K-Centroids Diagnostics Tool provides information to assist in determining how many clusters to specify," which would likely be of benefit to the user as well.

 

Sources:

https://stat.ethz.ch/R-manual/R-devel/library/base/html/Memory-limits.html

 

https://community.alteryx.com/t5/Alteryx-Designer-Knowledge-Base/Tool-Mastery-K-Centroids-Cluster-An...


2020-01-31 08_10_25-Greenshot.png

Highlighted
Alteryx Partner
Spoiler
Since this seems a memory error, I'd aim to reduce the amount of memory needed (provided Alteryx settings are already using the max ram available)
Some options:
1. Convert the csv to yxdb prior to running the workflow.
2. Drop all not-needed fields early on. Use only the bare minimum needed for the cluster allocation, then bring the cluster id back to the main stream of data via join/find&replace if needed.
3. Make sure all fields have proper sizes.
4. The cluster generation is working, so split the data for the cluster assignation in batches (batch macro with fixed object coming from the K-Cluster analysis tool).