Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Weekly Challenges

Solve the challenge, share your solution and summit the ranks of our Community!

Also available in | Français | Português | Español | 日本語
IDEAS WANTED

Want to get involved? We're always looking for ideas and content for Weekly Challenges.

SUBMIT YOUR IDEA

Challenge #131: Think Like a CSE... The R Error Message: cannot allocate vector of size...

JonathanRichey
7 - Meteor

 

Spoiler

Disclaimer… I have never used these predictive tools, this is my first attempt at a weekly challenge.

I was able to replicate the error. The issue is not bringing in the data from the same point (selection tool). I am assuming the user was using the select tool to change the data type therefore there were two different data types coming into the Append Cluster tool.

 

 

Spoiler
predictive.PNG
Spoiler
Solution.PNG

 

kat
12 - Quasar

Never used the tool before...

Spoiler
It looks like the dataset is too big for the tool. Here are a few suggestions on how to fix this. You could insert a select tool before the append (from the CSV) and optimise the field types. You could also try to append the clusters to the data in batches.

I'd recommend only sending a sample of the data in first. If this resolves the problem we'll know that it's due to the size and can work from there.

If you don't come right, could you send screenshots of the tool configurations? Perhaps we can optimise those a little as well.
PhilipMannering
16 - Nebula
16 - Nebula
Spoiler
I also struggled to replicate this error. Even with bumping up the number of records to over 13 billion my laptop seemed content to chug away at the clustering indefinitely. On the plus side I've learnt how to create my first unsupervised classification model in Alteryx. 

13 billion records. but no error.13 billion records. but no error.
NicoleJohnson
ACE Emeritus
ACE Emeritus

My solution... sort of. :)

 

Spoiler
Having never used the Cluster tools before, I will admit to having had to check out some of the suggestions/solutions that were already posted before "solving" this one... but what I can provide is a workflow + macro that will batch the results to achieve the type of solution suggested by several others! 

Basically used the Tile tool to create some batches, then summarized those to use as the Control Parameter for a batch macro, ran each batch through the Append Cluster Tool, then output the results to the workflow. Wasn't able to really test this, but I did add some notes so that the person implementing this revised process in their workflow would hopefully be able to do so easily with minimal configuration.

WeeklyChallenge131.JPGWeeklyChallenge131Macro.JPG

(Would normally have used an iterative macro because I like them better, and someone once said "There's nothing you can do with a batch macro that you can't also do with an iterative macro"... but my coworker regularly challenges me to use Batch Macros for simpler repetitive scenarios, so since I was already sort of cheating on this one, I figured I would at least choose the macro type that I don't normally gravitate towards...)

Cheers,

NJ

sh0kat
7 - Meteor
Spoiler
In my opinion, the issue can be resolved in a number of ways the best one being is to divide the dataset and use a batch macro to append clusters iteratively. Once the Cluster Analysis is done on the whole dataset, we do not need to process all source rows at once to append clusters. It can be done in batches and would not affect the output.
SeanAdams
17 - Castor
17 - Castor

So - always keen on hearing my computer beg for mercy under the load of a multi-GB recordset ....

 

Spoiler
... I searched the inter-tubes and found that this specific error message is (as suspected) due to memory constraints.
So - created a test set to torture my machine and see what happens....


2018-10-29_21-58-58.jpg

This test data set creates a few random numbers and a few null values - with a record-set the same size as the client.

Result?    Not much, except for an exciting power bill.

So not easy to replicate on my machine - but given that the problem is happening in the "Append Cluster" rather than the "K Centroids Cluster Analysis" - my vote would be to:
a) spilt the analysis vs. the scoring stages into 2 separate pieces (that creates a nice gap to purge working memory)
b) if the scoring stage continues to run out of memory then break this into batches.

2018-10-29_22-04-45.jpg2018-10-29_22-06-43.jpg

JoBen
11 - Bolide

Cheers! This was a fun one! I learned a lot about the predictive grouping tools.

 

Spoiler
So, off the bat, I made an assumption based on the screen shot from the user. Based on the name of the file in the picture, I'm assuming the initial data set has something to do with housing and that the user is trying to cluster by housing locations and maybe another variable (the K-Centroid Cluster tool can cluster based on a large number of dimensions, but living in the world we do, we tend to think in either two or three dimensions). So, basically, I tried to find a model data set that would be similar to my assumption- mainly I needed lots of fields with one of them being a location (lat and long coordinates). After a little searching, I was able to find Los Angeles crime data since 2010 from data.gov (https://catalog.data.gov/dataset/crime-data-from-2010-to-present). While not as large in size as the problem data set (569 MB, 1.8 million rows, and 26 fields), it's still pretty large and difficult to work with. 

Here is the workflow that I came up with. 
My workflowMy workflow
I initially attempted to push all 1.8 million records through the K-Centroid Cluster tool clustering off the lat and long coordinates. After about 5 minutes of waiting, I grew impatient and cancelled the workflow. Because a lot of the data happened at the same lat and long coordinates, I decided to group my coordinates and count the number of records for each group. I then added the count as a third variable to cluster. This reduced the number of records to process without compromising the integrity of the end result. 

Challenge_131_3.PNG
I was then able to join my original data set using the lat and long coordinates and bring the cluster number into my original data set. I ended up throwing this into Tableau to see how my clusters came out, and this is what I came up with. 
Challenge_131_2.PNG

 

JosephSerpis
17 - Castor
17 - Castor
Spoiler

The "cannot allocate vector" error is what R gives when it runs out of memory. The workflow is trying to hold 7531GB of data in the machine’s memory at once. Would suggest the user divides the data up in batches using a batch macro.

pasccout
8 - Asteroid
Spoiler
Not sure of the exact problem but it seems there is a select object pointing to the 2nd cluster object but not the model... 
    - based on my tests this can cause a problem if the fields are not of the same type.

The other thing I would do is ask for a really small sample with a few rows with obfuscated data...

The objects seem to be able to handle up to 1000000 records (when I tested it) and 30 columns of fake data...
jamielaird
14 - Magnetar

 

Spoiler

Based on my deep expertise in R a quick Google search it looks like the workflow is failing due to a lack of system resources (specifically, RAM).

 

See for reference: https://stackoverflow.com/questions/10917532/memory-allocation-error-cannot-allocate-vector-of-size-...

 

Two possible solutions are to:

 

1) Throw more resource at the problem, by running the workflow on a machine with more RAM

2) Make the workflow run more efficiently, so that it can complete with the resources you currently have available

 

The second option is the recommended approach.

 

The screenshot shows that the error is occurring in the 'Append Cluster' tool.

 

I found the following post on Alteryx Community (https://community.alteryx.com/t5/Alteryx-Knowledge-Base/Tool-Mastery-Append-Cluster/ta-p/194965) which states that "Because this tool applies a pre-built model to a data stream, the records being assigned clusters do not need to be fed in to the tool all at once".

 

Therefore, I recommend changing the workflow to contain a batch macro so that records are processed by the Append Cluster tool in smaller batches.

 

 

 

For further guidance I highly recommend the following essential technical guides:

 

1577caee90ed46e2c6eb3b1aeb3e7a70--o-reilly-hilarious-stuff.jpgCf7eHZ1W4AEeZJA.jpgorly-googling-error-messages.jpg