Want to get involved? We're always looking for ideas and content for Weekly Challenges.
SUBMIT YOUR IDEA
Sorry, I sent by Mistake, I should send it someone else, accept my apology.
I tried to practice your solution because I have the same problem but related to (Lenier regression), I noticed you used the Generate rows tool, As I understand it will create a new field if the number of rows reached 1000,000 but what does this mean? what it is useful for ? how it solves the problem of memory limitation? please explain more.
Note: After a while (around 7 minutes) when running your workflow My device halted and not responding !!??
For the particular problem of the coworker, the data size seems to be too big to handle. Connect from Select Tool to remove some fields may be able to solve the problem.
For my R error, the problem is field type mismatch. Connecting from Select Tool also solve the problem.
Chose to get the necessary data directly from the K-Centroids Cluster Analysis tool instead of using an Append Cluster tool. Assuming that the K-Centroids Cluster Analysis tool works, hopefully my solution would be a more optimised way of obtaining the data.
Big hint in the help page for this one!
First, I read the help on the Append Cluster tool, and there was a big hint here!
Knowing that 1) the actual cluster analysis runs, and 2) I don't need to send the full original dataset into the Append Clusters tool --> maybe it can be done in batches. I generated 5M fake records for testing.
My solution is pretty similar to the given solution:
Batch macro:
Outer workflow:
There also may be some opportunity for using more efficient data types. Possibly, that may be happening in the Select tool in the original workflow, in which case we should connect the Select output to the Append Clusters tool rather than the raw input.
"The "cannot allocate vector" error is what R gives when it runs out of memory; as you can see it's trying to allocate quite a large 5+ GB chunk of RAM. This is not unexpected for large datasets, as R can be somewhat memory inefficient. Potential solutions:
many issues do be thought of, but I could not replicate the same error message with any test dataset I had (probably due to the smaller size)
anyways: here's my workflow and thoughts: