Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Weekly Challenges

Solve the challenge, share your solution and summit the ranks of our Community!

Also available in | Français | Português | Español | 日本語
IDEAS WANTED

Want to get involved? We're always looking for ideas and content for Weekly Challenges.

SUBMIT YOUR IDEA

Challenge #131: Think Like a CSE... The R Error Message: cannot allocate vector of size...

cgoodman3
14 - Magnetar
14 - Magnetar
Spoiler
Like others have said the error is showing that it is a memory issue, therefore the solution is to batch the scoring of the data into chunks instead of trying to do it all in one go.

The tile tool is really good for this, and I feel this is an underused tool in my toolset.


Challenge 131b.PNGChallenge 131a.PNG
Chris
Check out my collaboration with fellow ACE Joshua Burkhow at AlterTricks.com
TonyA
Alteryx Alumni (Retired)

Some thoughts below. I'll append the note in a text file so I have something attached to my response. 

 

Spoiler
I've been doing some reading on the clustering tools - haven't had to use them yet. As I understand it, the clustering tool defines the clustering model and the append cluster tool assigns elements to the clusters. If that's the case and the tool is running out of memory because we're throwing too much at it, couldn't we just put the append tool into a batch macro and let it process the data in bite-size chunks that fit in memory? Just looked at the spoiler while writing this and sounds like I'm at least close to what the CSE came up with.
timrains
8 - Asteroid

Done. 

 

Spoiler
As others have discussed, this is all about the size of the data being used as an input into the append clusters. Best way around is to link up the append clusters to the out from the select tool. As the create clusters is working ok, this suggests that the clusters can be made, its just there may be too many large fields in the raw data. 
OllieClarke
15 - Aurora
15 - Aurora

Here's my solution:

 

Spoiler

cannot allocate vector of size 7531.1 Gb

Looks like a memory issue in the append cluster tool. 
I made a workflow to see if batching the process of the append cluster tool affected it
Spoiler
it doesn't

challenge 131.png

I would advise using a batch macro (like above) to only feed a subset of data into the append cluster tool at a time - hopefully solving the error.

 

 

KMiller
8 - Asteroid

Suggestions below.

Spoiler
From looking online, this is a RAM limitation issue. R works with data in memory. 7.5k Gb is way more than you will be able to put in your machine or even commercial cloud rentals. Processing smaller chunks of the data iteratively may be possible using macros.

 

 

 

T_Willins
14 - Magnetar
14 - Magnetar
Spoiler
Similar to other solutions - The issue is the lack of memory to complete this workflow as is.  Adding memory would be problematic due to size requirements and inefficient use of resources.  Better solution is to chunk the data into manageable pieces, which lends itself to a batch macro solution.  Tested batch macro with smaller hockey data set I had kicking around to verify batch macro returned the same results.
Spoiler
Macro 131.png

 

Workflow 131.png

 

rmassambane
10 - Fireball

Memory issued on append cluster, which means the clustering is working but the labeling isn't. So we can batch the latter.

 

Nice to know!

rmassambane
10 - Fireball
 
mbogusz
9 - Comet
Spoiler

As an Alteryx CSE I would investigate the error message and the tools used in the workflow and quickly find out:

 

"Error messages beginning 'cannot allocate vector of size' indicate a failure to obtain memory, either because the size exceeded the address-space limit for a process or, more likely, because the system was unable to provide the memory."

 

Then I'd ask the community and they'd say to use a batch macro since it's only the Append Cluster tool that's throwing the error and I'd send them this example of a batch macro workflow using the Append Cluster tool. It uses the Tile Tool to create X number of tiles/batches, which then runs those batches through.

 

I would also mention to the user that he/she should consider using the K-Centroids Diagnostics tool, which will output a 'K-Means Cluster Assessment Report'. Per the tool mastery article, "The K-Centroids Diagnostics Tool provides information to assist in determining how many clusters to specify," which would likely be of benefit to the user as well.

 

Sources:

https://stat.ethz.ch/R-manual/R-devel/library/base/html/Memory-limits.html

 

https://community.alteryx.com/t5/Alteryx-Designer-Knowledge-Base/Tool-Mastery-K-Centroids-Cluster-An...


2020-01-31 08_10_25-Greenshot.png

dsmdavid
11 - Bolide
Spoiler
Since this seems a memory error, I'd aim to reduce the amount of memory needed (provided Alteryx settings are already using the max ram available)
Some options:
1. Convert the csv to yxdb prior to running the workflow.
2. Drop all not-needed fields early on. Use only the bare minimum needed for the cluster allocation, then bring the cluster id back to the main stream of data via join/find&replace if needed.
3. Make sure all fields have proper sizes.
4. The cluster generation is working, so split the data for the cluster assignation in batches (batch macro with fixed object coming from the K-Cluster analysis tool).