Weekly Challenges

acarter881 · ‎10-13-2022

Here's my solution.

This stack overflow post explains memory management in R. Also, potentially running a batch macro to feed in data in chunks instead of all at once should mean bypassing the memory limitation. 7,000+ GBs (i.e., 7+ TBs) is a lot of RAM. I think it's beyond most computers. My computer has around 40 GB of RAM.

martinding · ‎10-21-2022

My first reaction was to increase memory (i.e. get more RAM) but on closer inspection, increasing the memory to 7500GB+ would be very very expensive.

The second thing that came to my mind is to load the data in batches, which is a common practice for loading large datasets.

A third strategy is actually to ask, do we need this much data? Can we perhaps perform some preprocessing and feature engineering to get rid of some uninformative data points and reduce the size of the data. Some common methods include dimensionality reduction.

JamesCharnley · ‎11-25-2022

Spoiler

I once took an R course that covered clustering so I started by opening the macro up and taking a look at the code. That was quickly a dead end since I couldn't understand a thing. Also didn't do a very good job in trying to recreate the error to fix. So I did the thing I should have started with which was google the error code and some bloke who's much smarter than me explained the issue of running out of RAM, while suggesting either finding more RAM or batching in simulations of smaller groups of n. Seems to make sense.

grazitti_sapna · ‎12-11-2022

Solution

Sapna Gupta

ahsanaali · ‎12-16-2022

Spoiler

When size is an issue, splitting into batches seems like a pragmatic solution.

alacoume · ‎04-06-2023

I thought that:

Spoiler

this is an error message related to exceeding memory in R, so data is too big.
Loading data in batch could solve this issue.

this is an error message related to exceeding memory in R, so data is too big. Loading data in batch could solve this issue.

chandler-gjino · ‎06-01-2023

attaching my solution

ed_hayter · ‎08-17-2023

Spoiler

Tried to recreate the error first - generating 3 mil rows, but only 4 fields. Could not recreate the error - presumably need all 30 fields to recreate it and some info in them. Decided it was quite time intensive to do that.

Read the error message in the post and it said the vector being sent into the cluster is too large.

A google of the error reveals that allocating more memory is one solution. Alternatively, if you could think of a way of breaking up the data, for example taking a more recent sample if that is more relevant. Or a random sample and think of a way to extrapolate on the random sample you could work round the error.

Tried to recreate the error first - generating 3 mil rows, but only 4 fields. Could not recreate the error - presumably need all 30 fields to recreate it and some info in them. Decided it was quite time intensive to do that.Read the error message in the post and it said the vector being sent into the cluster is too large.A google of the error reveals that allocating more memory is one solution. Alternatively, if you could think of a way of breaking up the data, for example taking a more recent sample if that is more relevant. Or a random sample and think of a way to extrapolate on the random sample you could work round the error.

mithily · ‎08-23-2023

Spoiler

some suggestions

1. I think maybe we need a way to batch the records so the append cluster does not error out with a vector size error.

Other than that I need to revisit this. I did create a dummy data set using the ideas in the solution but not sure i am doing it right.

I did batch it and run and got an output with no vector size errors.

geoff_zath · ‎08-28-2023

I was unable to replicate the error. I created a dummy dataset and tried to hit a memory limit, but even when setting Designer to 400 MB, I wasn't able to hit a limit. I think the page file was being used instead of hitting a limit with the underlying R code.

I have a feeling there was some mismatch between what was being fed to the select tool and therefore the K Centroids Cluster tool, and that caused some error once the original data was being scored. Another possibility is that only a fraction of the total size of the initial dataset was used to train the clustering (hence why they used a select tool), then when trying to score the resulting dataset was just too big. If it was the latter, I'd suggest saving the cluster model object in one workflow and then scoring the data in another batch macro workflow that will require less data to be loaded at a time.

Spoiler

Weekly Challenges

IDEAS WANTED

Challenge #131: Think Like a CSE... The R Error Message: cannot allocate vector of size...