Want to get involved? We're always looking for ideas and content for Weekly Challenges.
SUBMIT YOUR IDEAThe solution to last week's Challenge has been posted here!
We are thrilled to present another Challenge from our “Think like a CSE” series, brought to you by our fearless team of Customer Support Engineers. Each month, the Customer Support team will ask Community members to “think like a CSE” to try to resolve a case that was inspired by real-life issues encountered by Alteryx users like you! This month we present the case of the R Error Message: cannot allocate vector of size 7531.1 Gb.
Below, we’ve provided the information that was initially available to the Customer Support Engineer who resolved the case. It’s up to you to use this information to put a solution together for yourself.
The Case: A co-worker is running into the R error “Append Cluster: Error: cannot allocate vector of size 7531.1 Gb“ in a workflow that uses a couple of the Predictive Clustering Tools. This error is causing the workflow to stop running before completing. A screenshot of the workflow, and the error(s) are as follows.
Your Goal: Identify the root cause of the issue, and develop a solution or workaround to help your co-worker get past this error and finish running the workflow.
Asset Description: Your co-worker can’t share the file with you due to client privacy concerns, but it is about 7 GB in size (3 million records and 30 fields). Using only the provided screenshots, dummy data of your own design, and sheer willpower, can you develop a possible resolution?
The Solution:
Thank you for participating in this week’s think like a CSE challenge!
Like many of you, when taking on this case I first asked for the data set to attempt to reproduce the error on my own machine. This is often a first step particularly for workflows that use predictive tools, as often errors with the predictive tools are caused by the data. However, as noted in the asset description the user was unable to share the data with me due to privacy concerns. This is not uncommon in Support- users are often unable to share their data or workflows due to privacy or other concerns.
To work around this, we will either generate or find a dummy data set to attempt to reproduce the error and explore workarounds.
Many of you correctly identified that the core cause of this error is memory. This is something you can find with a quick internet search of the error message. You can read more about R memory limitations here.
The error is stating that the code is unable to allocate a victory of 7531 GB, which means it is trying to hold that much data in the machine’s memory at once. With this in mind, I did not feel a RAM upgrade would resolve this issue, as 7531 GB of Ram would be difficult to find, and prohibitively expensive.
What is interesting about the error in this workflow is that it is coming from the Append Cluster Tool, which is effectively a Score Tool for the K-Centroids Cluster Analysis Tool. This suggests that the clustering solution itself is being built successfully by the K-Centroids Cluster Analysis Tool, and the workflow is running into the error while trying to assign cluster labels to the original data based on the clustering solution. Because the Append Cluster Tool is essentially a Score Tool, there is no need to try to run all the data through it at once. This gives us an opportunity for a workaround.
I suggested the user try to divide up the data set and run it through the Append Cluster Tool in multiple batches after creating the Clustering solution with the K-Centroids Cluster Analysis Tool. The most efficient way to do this is to build a batch macro with the Append Cluster Tool. There is a great Community Article on this second method called Splitting Records into Smaller Chunks to make a Workflow Process Quicker. The user confirmed that this method resolved the error on their machine. Case Closed!
It is important to remember that this strategy to work around the error is an option because the model itself is being built without issue, and the Append Cluster Tool’s results will not change if you run different groups of data through the tool at different times. Many of the predictive tools do require that the data all be provided at once to build a model, but the Score Tool and the Append Cluster Tool are simply applying a model to data to estimate the target values of the records.
Thank you again for participating! I hope that this has been an informative challenge and that you had as much fun working through it as I did!
I was unable to replicate this error, so I'm not sure if these suggestions will help (or if this response will count as a completed challenge).
I've never used the cluster tools before, but....
Having trouble recreating a data set here (dummy data would always help), so just going for some suggestions instead of attaching a workflow
Here is my attempt at a root cause and possible solutions...
I am not very experienced with this tool so I used R forum to determine problems. Not 100% on my answer either so I'm looking forward to the answer being posted!
I've used these tools before so I have a few ideas. It seems that the size of the data set with 3 million rows and 30 variables is the culprit!
I'm attacking this from a sys admin point of view, since most of the Alteryx/R related points having to do with data set size, process optimisation and memory management were covered previously.
Dan