ALTERYX INSPIRE | Join us this May for for a multi-day virtual analytics + data science experience like no other! Register Now
The Alteryx Community will be temporarily unavailable for a time due to scheduled maintenance on Thursday, April 22nd. Please plan accordingly.

Data Science

Machine learning & data science for beginners and experts alike.
Alteryx Alumni (Retired)

yhat-classic-sticker.pngR is notoriously a memory heavy language. I don't necessarily think this is a bad thing--R wasn't built to be super performant, it was built for analyzing data! That said, there are times when there are some implementation patterns that are quite...redundant. As an example, I'm going to show you how you can prune a 330 MB glm to 45KB without losing significant functionality.

Let's trim the R fatLet's trim the R fatLe Model


Our model is going to be super simple. We're just going to build a logistic regression model that predicts whether or not a record from the iris dataset belongs to the setosa species. A normal version of this model would look like this:


fit <- glm(I(Species=="setosa") ~ ., data=iris)
print(paste("Size on 150 rows:", format(object.size(fit), unit="Mb")))
# [1] "Size on 150 rows: 141.8 Kb"


But we're going to intentionally make our model bigger. Much, much bigger. Like big data big. To do that, we'll randomly sample 500,000 rows from iris to make iris.big and then retrain our model.


I realize this isn't the best way to sample data when building a model, but it'll serve our purposes just fine :).


n <- sample(1:150, 500000, replace=TRUE)
iris.big <- iris[n,]
fit <- glm(I(Species=="setosa") ~ ., data=iris.big)
print(paste("Size before pruning:", format(object.size(fit), unit="Mb")))
# [1] "Size before pruning: 330.6 Mb"


bad-sampling.jpgIf anyone asks, it's pseudo-random


330 MB seems a little large for a simple glm. Let's see what underlying data we can strip away.


Where's the bloat?


First things first. Let's take a look at all of the variables in fit and see what's causing all this mayhem.


d <- ldply(names(fit), function(v) {
  v.size <- format(object.size(fit[[v]]), unit="Mb")
  data.frame(variable=v, size=v.size)
d[order(as.numeric(d$size), decreasing=TRUE),]
#             variable    size
#                model 44.8 Mb
#                 data 44.8 Mb
#                   qr 46.7 Mb
#            residuals 31.4 Mb
#        fitted.values 31.4 Mb
#    linear.predictors 31.4 Mb
#              weights 31.4 Mb
#        prior.weights 31.4 Mb
#                    y 29.5 Mb
#              effects  7.6 Mb
#         coefficients    0 Mb 
# ...

Ok so there's no way that we need to save all of this data just to make a prediction (after all, it's just coefficients right!?!). Let's see what we can get away with chopping.


But if we delete stuff, won't that break things?


At Yhat, we're all about making predictions using R and Python models. So for R models, what we're really concerned with is being able to successfully call the predict function on fit.


I'm going to set aside a validation/test set of predictions that I can use later to make sure that my modified fit is still working correctly.


expected.results <- predict(fit, newdata=iris)


Just take a little off the top


I started looking through the heavy variables and found that most of them were some sort of stored training data (data, y) or some sort of diagnostic data for the model (fitted.values, linear.predictors, residuals). A hunch told me that the model didn't actually need any of these to make a prediction.



 Apparently sheep shearing is a big deal.


Let the carnage begin. We'll start by deleting some of the largest variables.


fit$data <- NULL
fit$y <- NULL
fit$linear.predictors <- NULL
fit$weights <- NULL
fit$fitted.values <- NULL
fit$model <- NULL
fit$prior.weights <- NULL
fit$residuals <- NULL
fit$effects <- NULL
format(object.size(fit), unit="Mb")
# [1] "46.7 Mb"
all(predict(fit, newdata=iris)==expected.results)

Ok 45MB isn't bad and we're still getting valid results from our predict call! But I'm a little greedy. I want to eliminate the 46 MB that's still plaguing us from the qr variable.


Unfortunately, I found the pesky qr object COULD NOT be removed from fit...entirely. However when you remove the qr$qr variable (I know it's a ridiculous name), things seem to be ok.


fit2 <- fit
fit2$qr <- NULL
all(predict(fit2, newdata=iris)==expected.results) # removing qr entirely doesn't work
fit$qr$qr <- NULL
all(predict(fit, newdata=iris)==expected.results) # just removing qr$qr still works :)


The finale


Alright so I've managed to nuke 80% of the underlying data in my model. How big is it now?


print(paste("Size after pruning:", format(object.size(fit), unit="Kb")))
# [1] "Size after pruning: 45 Kb"

That's right. We've managed to reduce our model by a factor of 7000. All while not losing what we deem to be "core functionality"!


Final thoughts


Thanks to Harlan Harris who originally gave us the idea for this post! If you're interested in reading more about this topic, check out these resources: