Cluster Analysis - Customer R tool, Errors out


I am using the R tool for a hierarchical cluster analysis, Ward clustering. The tool runs without error only if a sample of the dataset is passed through (as tested max is 12k records by 70 columns (fields)). Ultimately, I am aiming to run 1M records by 70 columns through.


I have saved the error log for the R script for both the full dataset (1M records) and the sample (120k records)- you will notice the errors are different between the two.  Please advise.


1. Beginning with the R tool (R 140) the full data-set returns errors "cannot allocate vector size of 5190.1GB","execution halted", then "R.exe exit code (4294967295) indicated an error". Further the R tool does not create any outputs.
error log_11_06_2018.PNG


2. When only a sample is run again no output is created and the error "Error in hclust(d,method='ward.D2)" is produced. 

error log_11_06_2018_Sample.PNG


Hi @grftjw,


The "cannot allocate vector" error is what R gives when it runs out of memory; as you can see it's trying to allocate quite a large 5+ GB chunk of RAM.  This is not unexpected for large datasets, as R can be somewhat memory inefficient.


Potential solutions:


  • Increase RAM on your workstation. (Not a convenient option, but not a joke: more RAM helps).
  • Find a comfortable data size, and run through only that much, in chunks.
    • This assumes chunking is possible with whatever kind of analysis you're doing... it may be that it doesn't work for your case; thus your 2nd error.

More out in left field, can your data be made smaller in any way? Assign factors?  Doubtful this will help, but it's a thought.



Given your large number of fields, you may want to explore using the Principal Components tool to condense some of these fields if possible. This tool is especially useful if you have related fields - ie 12 of the same field but each field is the value for a specific month.