I'm struggling to figure out how to create an efficient and intelligent module that will be easy to manage as my data becomes more complex. I've created an example here to show you what I mean. There must be a clever solution that will allow my workflow to scale up. I'll try to explain it as best I can...
And here is my question: This solution seems to work well, and although my example is much simpler than my real data, my solution works to automatically calculate the 'missing' rows required for my dashboard. However, what if I want to cut my data by 3, 4 or 5 variables? What if I want to cut the data by gender (x 3), income bands (x 10), age bands (x 10), country (x 50), etc?
I need to calculate a lot of rows here (which is fine, as my dashboard software is not smart enough to aggregate on the fly, but it does retrieve rows from a long table very quickly). The problem is that in its current layout, my module will have to grow massively for every extra variable. There must be a super smart way to design this process without having to create 10s or 100s of separate Summarize tools, all with slightly different settings. I wonder if there's a way to build a workflow or macro in which you can specify how to cut the data, and it does the rest. Does anyone have any ideas?
Solved! Go to Solution.
That comes from the addmargins() function, which can utilize any functions for the margin; the default is Sum(...) and it just uses the function name. So one thing we could do is create another function that is essential a rename of Sum using something you're pretty sure will never appear in the data...
# add marginal sums...
any_value <- sum xt2 <- addmargins(xt1,FUN=any_value)
Unfortunately I couldn't get the nice parenthetical "(any)" there, which I sort of liked... but we can make that change before leavign R if desired...
# format table, and convert to data frame... df2 <- as.data.frame(ftable(xt2)) i <- sapply(df2, is.factor) df2[i] <- lapply(df2[i], as.character) df2[df2=='any_value'] <- "(any)"
This is all in the newly attached macro.
Thank you, that's excellent.
I've come up against a problem. The operational data I'm trying to aggregate is so big it's crashing - I've attached a sample of test data (not confidential) to give you an idea.
I wonder if there's a way to reduce the number of 0s in the tools output, because it (sensibly, as per my brief) sums all possible combinations. However in real life the actual number of combinations are less than the theoretical number. I wonder if this efficiency would be enough to prevent my system from crashing. I even got some kind of error from R warning that the 'vector space' or something like that exceeded 50GB so the tool exited!
In other words, for an input like this...
...you get an output like this which contains lots of 0s which are potential combinations that don't actually occur anywhere...
Or is there another way to do the aggregation on my full dataset (attached) that won't crash R / Alteryx?
Thanks!
The attached adds a checkbox to the macro allowing you to remove zero-frequency rows before handing things back to Alteryx.
I was able to select more rows with this, and it returned over a million possibilities back to Alteryx. If the problem is deeper in R, though (e.g. in calls to addmargins or xtabs), then unfortunately I won't have an answer for that.
Thanks. I wonder how much RAM you've got? I just upgraded from 8GB to 16 but am still hitting limits.
For example, in the sample dataset I posted yesterday, I can select all the checkboxes from the top down until this point:
I can't tick the last three (the bottom one is the count statistic) because it takes ages and eventually crashes. Can you get these to work?
Many thanks
I'll have a look in a bit - I'm using a crummy dual core i5 with 8GB RAM, so nothing special.
Thanks. I wonder if there are any tricks with field names or data types that could make the process less memory demanding. I can see that with each new field I include in the input, the number of calculations grows by a large amount! If not exponentially then certainly an order or magnitude. I have a quad core i5 and more RAM than you so it's quite puzzling. I'll maybe try this on my i7 machine at home....
I went into R-Studio at home also, (i7, 16GB) to putz around a bit and basically it fails on xtabs which isn't too surprising: it's expanding to every possible combination of every value entered in the grid; so the grid being 11 x 185,000 and then turn that into combinations... it's basically the "subset-sum" problem which is an exponential-time problem; we're trying to do it in exponential space... so, even if we found a 64GB machine to throw at it, there might still be problems in the general case, e.g. if you add more columns to the mix.
And if we think about it, that's more rows than are really useful anyway: nobody is ever going to look at them all. So the next step might be to redefine the problem: are these being fed into a calculation, or are they looking for particular cases... or even would it be OK to generate any specific case... etc...?
We could hard-code a solution with loops, but I'm concerned that would simply shift it from space-wise inefficiencies to time-wise inefficiencies, as above... (even my earlier hard-coded example with some loops failed on the expand.grid function in R before it even got to the loops; we'd have to convert expand.grid to a custom loop also... I don't think that's worth the effort, based on the above discussion.)
I think there's a solution to make it more effecient, I just need a little bit of help to get there.
Not all the variables need to be aggregated against each other. Is it possible to nominate some fields as the 'grouping' field, and other variables as the aggregation fields. For example, can we treat fields to the left of the yellow line as group by fields, and can the macro process be run once for each unique category in the grouping field?
Records to the right are the ones that need to have the All added, which means their total number of combinations will be quite big. Still, they ought to be in the 100s of thousands, not billions.
Is there an easy way to feed chunks of data into a macro based on a grouping field? In effect, running the macro X number of times and collecting all the results into one big union.
Sorry I have explained this very badly, I should be able to pull together a better example tomorrow.
Makes sense... R is functional if nothing else, so I should be able to encapsulate what's there into a function that is applied over the relevant groups. Will have a look this evening.
Here's what I'm thinking - if I filter off one course ( [Course] = "Lower-SchoolA - Course 1" ) then I can run that through your excellent tool, configured with only the following fields in the checkbox:
And summing on count as usual. There are 7 * 3 * 6 * 9 * 3 * 9 = 30,618 possible combinations, of which 14,876 yield non-zero results.
How could the macro be adapted to run once for each unique value of the 'Course' field, and then union the results into one massive list? I think that could work! Thanks