Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Weekly Challenges

Solve the challenge, share your solution and summit the ranks of our Community!

Also available in | Français | Português | Español | 日本語
IDEAS WANTED

Want to get involved? We're always looking for ideas and content for Weekly Challenges.

SUBMIT YOUR IDEA

Challenge #131: Think Like a CSE... The R Error Message: cannot allocate vector of size...

acarter881
12 - Quasar

Here's my solution.

Spoiler
This stack overflow post explains memory management in R. Also, potentially running a batch macro to feed in data in chunks instead of all at once should mean bypassing the memory limitation. 7,000+ GBs (i.e., 7+ TBs) is a lot of RAM. I think it's beyond most computers. My computer has around 40 GB of RAM.
martinding
13 - Pulsar

My first reaction was to increase memory (i.e. get more RAM) but on closer inspection, increasing the memory to 7500GB+ would be very very expensive.

 

The second thing that came to my mind is to load the data in batches, which is a common practice for loading large datasets.

 

A third strategy is actually to ask, do we need this much data? Can we perhaps perform some preprocessing and feature engineering to get rid of some uninformative data points and reduce the size of the data. Some common methods include dimensionality reduction.

JamesCharnley
13 - Pulsar
Spoiler
I once took an R course that covered clustering so I started by opening the macro up and taking a look at the code. That was quickly a dead end since I couldn't understand a thing. Also didn't do a very good job in trying to recreate the error to fix. So I did the thing I should have started with which was google the error code and some bloke who's much smarter than me explained the issue of running out of RAM, while suggesting either finding more RAM or batching in simulations of smaller groups of n. Seems to make sense.
grazitti_sapna
17 - Castor

Solution

Sapna Gupta
ahsanaali
11 - Bolide
Spoiler
ahsanaali_0-1671261680677.png

When size is an issue, splitting into batches seems like a pragmatic solution.

alacoume
9 - Comet

I thought that:

Spoiler
this is an error message related to exceeding memory in R, so data is too big. 
Loading data in batch could solve this issue.
chandler-gjino
Alteryx
Alteryx

attaching my solution

ed_hayter
12 - Quasar
Spoiler
Tried to recreate the error first - generating 3 mil rows, but only 4 fields. Could not recreate the error - presumably need all 30 fields to recreate it and some info in them. Decided it was quite time intensive to do that.

Read the error message in the post and it said the vector being sent into the cluster is too large.

A google of the error reveals that allocating more memory is one solution. Alternatively, if you could think of a way of breaking up the data, for example taking a more recent sample if that is more relevant. Or a random sample and think of a way to extrapolate on the random sample you could work round the error.
mithily
8 - Asteroid
Spoiler
some suggestions

1. I think maybe we need a way to batch the records so the append cluster does not error out with a vector size error.

Other than that I need to revisit this. I did create a dummy data set using the ideas in the solution but not sure i am doing it right.

I did batch it and run and got an output with no vector size errors.

geoff_zath
Alteryx
Alteryx

I was unable to replicate the error. I created a dummy dataset and tried to hit a memory limit, but even when setting Designer to 400 MB, I wasn't able to hit a limit. I think the page file was being used instead of hitting a limit with the underlying R code.

 

I have a feeling there was some mismatch between what was being fed to the select tool and therefore the K Centroids Cluster tool, and that caused some error once the original data was being scored. Another possibility is that only a fraction of the total size of the initial dataset was used to train the clustering (hence why they used a select tool), then when trying to score the resulting dataset was just too big. If it was the latter, I'd suggest saving the cluster model object in one workflow and then scoring the data in another batch macro workflow that will require less data to be loaded at a time. 

 

Spoiler
challenge_131_workflow.png