Challenge #131: Think Like a CSE... The R Error Message: cannot allocate vector of size...
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Here's my solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
My first reaction was to increase memory (i.e. get more RAM) but on closer inspection, increasing the memory to 7500GB+ would be very very expensive.
The second thing that came to my mind is to load the data in batches, which is a common practice for loading large datasets.
A third strategy is actually to ask, do we need this much data? Can we perhaps perform some preprocessing and feature engineering to get rid of some uninformative data points and reduce the size of the data. Some common methods include dimensionality reduction.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
When size is an issue, splitting into batches seems like a pragmatic solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I thought that:
Loading data in batch could solve this issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Read the error message in the post and it said the vector being sent into the cluster is too large.
A google of the error reveals that allocating more memory is one solution. Alternatively, if you could think of a way of breaking up the data, for example taking a more recent sample if that is more relevant. Or a random sample and think of a way to extrapolate on the random sample you could work round the error.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
1. I think maybe we need a way to batch the records so the append cluster does not error out with a vector size error.
Other than that I need to revisit this. I did create a dummy data set using the ideas in the solution but not sure i am doing it right.
I did batch it and run and got an output with no vector size errors.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I was unable to replicate the error. I created a dummy dataset and tried to hit a memory limit, but even when setting Designer to 400 MB, I wasn't able to hit a limit. I think the page file was being used instead of hitting a limit with the underlying R code.
I have a feeling there was some mismatch between what was being fed to the select tool and therefore the K Centroids Cluster tool, and that caused some error once the original data was being scored. Another possibility is that only a fraction of the total size of the initial dataset was used to train the clustering (hence why they used a select tool), then when trying to score the resulting dataset was just too big. If it was the latter, I'd suggest saving the cluster model object in one workflow and then scoring the data in another batch macro workflow that will require less data to be loaded at a time.