Want to get involved? We're always looking for ideas and content for Weekly Challenges.
SUBMIT YOUR IDEAHere's my solution.
My first reaction was to increase memory (i.e. get more RAM) but on closer inspection, increasing the memory to 7500GB+ would be very very expensive.
The second thing that came to my mind is to load the data in batches, which is a common practice for loading large datasets.
A third strategy is actually to ask, do we need this much data? Can we perhaps perform some preprocessing and feature engineering to get rid of some uninformative data points and reduce the size of the data. Some common methods include dimensionality reduction.
When size is an issue, splitting into batches seems like a pragmatic solution.
I thought that:
1. I think maybe we need a way to batch the records so the append cluster does not error out with a vector size error.
Other than that I need to revisit this. I did create a dummy data set using the ideas in the solution but not sure i am doing it right.
I did batch it and run and got an output with no vector size errors.
I was unable to replicate the error. I created a dummy dataset and tried to hit a memory limit, but even when setting Designer to 400 MB, I wasn't able to hit a limit. I think the page file was being used instead of hitting a limit with the underlying R code.
I have a feeling there was some mismatch between what was being fed to the select tool and therefore the K Centroids Cluster tool, and that caused some error once the original data was being scored. Another possibility is that only a fraction of the total size of the initial dataset was used to train the clustering (hence why they used a select tool), then when trying to score the resulting dataset was just too big. If it was the latter, I'd suggest saving the cluster model object in one workflow and then scoring the data in another batch macro workflow that will require less data to be loaded at a time.