community
cancel
Showing results for 
Search instead for 
Did you mean: 
Do you have the skills to make it to the top? Subscribe to our weekly challenges. Try your best to solve the problem, share your solution, and see how others tackled the same problem. We share our answer too.
Weekly Challenge
Do you have the skills to make it to the top? Subscribe to our weekly challenges. Try your best to solve the problem, share your solution, and see how others tackled the same problem. We share our answer too.
Unable to display your progress at this time. Please try again a little later, or contact an administrator if you continue to see this error.

Challenge #131: Think Like a CSE... The R Error Message: cannot allocate vector of size...

 

Spoiler

Disclaimer… I have never used these predictive tools, this is my first attempt at a weekly challenge.

I was able to replicate the error. The issue is not bringing in the data from the same point (selection tool). I am assuming the user was using the select tool to change the data type therefore there were two different data types coming into the Append Cluster tool.

 

 

Spoiler
predictive.PNG
Spoiler
Solution.PNG

 

Quasar

Never used the tool before...

Spoiler
It looks like the dataset is too big for the tool. Here are a few suggestions on how to fix this. You could insert a select tool before the append (from the CSV) and optimise the field types. You could also try to append the clusters to the data in batches.

I'd recommend only sending a sample of the data in first. If this resolves the problem we'll know that it's due to the size and can work from there.

If you don't come right, could you send screenshots of the tool configurations? Perhaps we can optimise those a little as well.
Alteryx Certified Partner
Spoiler
I also struggled to replicate this error. Even with bumping up the number of records to over 13 billion my laptop seemed content to chug away at the clustering indefinitely. On the plus side I've learnt how to create my first unsupervised classification model in Alteryx. 

Challenge 131 - Capture.PNG13 billion records. but no error.

My solution... sort of. :)

 

Spoiler
Having never used the Cluster tools before, I will admit to having had to check out some of the suggestions/solutions that were already posted before "solving" this one... but what I can provide is a workflow + macro that will batch the results to achieve the type of solution suggested by several others! 

Basically used the Tile tool to create some batches, then summarized those to use as the Control Parameter for a batch macro, ran each batch through the Append Cluster Tool, then output the results to the workflow. Wasn't able to really test this, but I did add some notes so that the person implementing this revised process in their workflow would hopefully be able to do so easily with minimal configuration.

WeeklyChallenge131.JPGWeeklyChallenge131Macro.JPG

(Would normally have used an iterative macro because I like them better, and someone once said "There's nothing you can do with a batch macro that you can't also do with an iterative macro"... but my coworker regularly challenges me to use Batch Macros for simpler repetitive scenarios, so since I was already sort of cheating on this one, I figured I would at least choose the macro type that I don't normally gravitate towards...)

Cheers,

NJ

Meteor
Spoiler
In my opinion, the issue can be resolved in a number of ways the best one being is to divide the dataset and use a batch macro to append clusters iteratively. Once the Cluster Analysis is done on the whole dataset, we do not need to process all source rows at once to append clusters. It can be done in batches and would not affect the output.
Aurora
Aurora

So - always keen on hearing my computer beg for mercy under the load of a multi-GB recordset ....

 

Spoiler
... I searched the inter-tubes and found that this specific error message is (as suspected) due to memory constraints.
So - created a test set to torture my machine and see what happens....


2018-10-29_21-58-58.jpg

This test data set creates a few random numbers and a few null values - with a record-set the same size as the client.

Result?    Not much, except for an exciting power bill.

So not easy to replicate on my machine - but given that the problem is happening in the "Append Cluster" rather than the "K Centroids Cluster Analysis" - my vote would be to:
a) spilt the analysis vs. the scoring stages into 2 separate pieces (that creates a nice gap to purge working memory)
b) if the scoring stage continues to run out of memory then break this into batches.

2018-10-29_22-04-45.jpg2018-10-29_22-06-43.jpg

Asteroid

Cheers! This was a fun one! I learned a lot about the predictive grouping tools.

 

Spoiler
So, off the bat, I made an assumption based on the screen shot from the user. Based on the name of the file in the picture, I'm assuming the initial data set has something to do with housing and that the user is trying to cluster by housing locations and maybe another variable (the K-Centroid Cluster tool can cluster based on a large number of dimensions, but living in the world we do, we tend to think in either two or three dimensions). So, basically, I tried to find a model data set that would be similar to my assumption- mainly I needed lots of fields with one of them being a location (lat and long coordinates). After a little searching, I was able to find Los Angeles crime data since 2010 from data.gov (https://catalog.data.gov/dataset/crime-data-from-2010-to-present). While not as large in size as the problem data set (569 MB, 1.8 million rows, and 26 fields), it's still pretty large and difficult to work with. 

Here is the workflow that I came up with. 
Challenge_131_1.PNGMy workflow
I initially attempted to push all 1.8 million records through the K-Centroid Cluster tool clustering off the lat and long coordinates. After about 5 minutes of waiting, I grew impatient and cancelled the workflow. Because a lot of the data happened at the same lat and long coordinates, I decided to group my coordinates and count the number of records for each group. I then added the count as a third variable to cluster. This reduced the number of records to process without compromising the integrity of the end result. 

Challenge_131_3.PNG
I was then able to join my original data set using the lat and long coordinates and bring the cluster number into my original data set. I ended up throwing this into Tableau to see how my clusters came out, and this is what I came up with. 
Challenge_131_2.PNG

 

Asteroid
Spoiler

The "cannot allocate vector" error is what R gives when it runs out of memory. The workflow is trying to hold 7531GB of data in the machine’s memory at once. Would suggest the user divides the data up in batches using a batch macro.

Asteroid
Spoiler
Not sure of the exact problem but it seems there is a select object pointing to the 2nd cluster object but not the model... 
    - based on my tests this can cause a problem if the fields are not of the same type.

The other thing I would do is ask for a really small sample with a few rows with obfuscated data...

The objects seem to be able to handle up to 1000000 records (when I tested it) and 30 columns of fake data...