This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I am using the R tool for a hierarchical cluster analysis, Ward clustering. The tool runs without error only if a sample of the dataset is passed through (as tested max is 12k records by 70 columns (fields)). Ultimately, I am aiming to run 1M records by 70 columns through.
I have saved the error log for the R script for both the full dataset (1M records) and the sample (120k records)- you will notice the errors are different between the two. Please advise.
1. Beginning with the R tool (R 140) the full data-set returns errors "cannot allocate vector size of 5190.1GB","execution halted", then "R.exe exit code (4294967295) indicated an error". Further the R tool does not create any outputs.
2. When only a sample is run again no output is created and the error "Error in hclust(d,method='ward.D2)" is produced.
The "cannot allocate vector" error is what R gives when it runs out of memory; as you can see it's trying to allocate quite a large 5+ GB chunk of RAM. This is not unexpected for large datasets, as R can be somewhat memory inefficient.
Increase RAM on your workstation. (Not a convenient option, but not a joke: more RAM helps).
Find a comfortable data size, and run through only that much, in chunks.
This assumes chunking is possible with whatever kind of analysis you're doing... it may be that it doesn't work for your case; thus your 2nd error.
More out in left field, can your data be made smaller in any way? Assign factors? Doubtful this will help, but it's a thought.
Given your large number of fields, you may want to explore using the Principal Components tool to condense some of these fields if possible. This tool is especially useful if you have related fields - ie 12 of the same field but each field is the value for a specific month.