Alteryx Designer Desktop Discussions

Ashish · ‎03-31-2016

Hello,

I want to run elastic net regression, which I am trying to run by using "GLMNET" package in "R Tool". I do not have much experiece in R and I am running into errors "S4 class is not subsettable" and I am sure I will get more once I solve this one.

Has any one sucessfully used Lasso/Ridge/Elasticnet regression in Alteryx? what package did you use and can you share with me how you did it ?

Thanks!

Ashish

BridgetT · ‎03-31-2016

Hi @Ashish,

Is there any chance you can attach your data? In general S4 classes are a bit tricky. If you're still new to R, I'd recommend sticking with S3 classes as much as you can. The Google R style guide even says to "avoid S4 objects and methods when possible; never mix S3 and S4." But if you absolutely must use data from class S4, this page has several good resources to get you started.

Anyway, I'm pretty sure that you can only use glmnet with S3 classes, so you're going to need to look elsewhere if you want to perform elastic net regression on your data. You could try this package, which does have an elastic.net function. The pdf I linked indicates that the function produces S4 models, so I'd assume that it also takes in S4 data. However, I couldn't find an indication either way when I skimmed the pdf. It's worth a try at least. Another option would be to use a different model entirely on your data. Since you want to use elastic net regression, I'm assuming that you're looking for a model that will perform variable selection, ideally to the point where a (relatively) small number of features are selected. I just came across this sparse analogue of Support Vector Machines called Relevance Vector Machines, which conveniently also has an R implementation for S4 classes.

Anyway, I hope at least one of those solutions works for you! Let me know if you have any more questions.

Best,

Bridget

Edit: Actually, it turns out that Elastic Net reduces to Linear SVM. Also, the package I linked you before with the S4 implementation for Relevance Vector Machines also includes Support Vector Machines! However, implementing the correct parameters for this reduction may be a bit tricky, so you might want to try the other approaches I suggested first.

Bridget Toomey

Research Scientist, Analytic Products

Alteryx

Ashish · ‎04-05-2016

Hi Bridget,

Thanks a lot for the resources and insights. It turned out that it was mainly my inexperiece with Alteryx and R in combination which was the cause of the error.

The error was happening at "write.alteryx" due to the reason that I was trying to output the coefficients of model as is (without converting it to a matrix). I am not sure if that's the correct reason but the issue got resolved when I converted the coefficients in to a matrix.

I found that I have to convert data frames to a matrix to write out, for those cases where I wanted data frame back in output, I convert them to matrix first and then again to datafame to writeout. I am sure I am missing somthing here. will appreciate any insights.

# Read the data stream into R
the.data_test <- read.Alteryx("#1", mode="data.frame")
the.data_train <- read.Alteryx("#2", mode="data.frame") 

require(glmnet)

#Empty data frame to hold predicted data
newdf <-as.data.frame(matrix(NA,nrow = 0, ncol = 32))
#Empty data frame to hold coefficients 
coef_fit <-as.data.frame(matrix(NA,nrow = 0, ncol = 18))

#Create Group variable with unique group valeus to use to iterate through the loop
Grp <- unique(the.data_train[,31])

#Using For loop to create separate model for each group
for (i in Grp){
	filt_data <- the.data_train[which(the.data_train$Group == i),]
	pred_matrix <- as.matrix(filt_data[,15:30])
	target_var <- as.matrix(filt_data[,14])
	fit <- glmnet(pred_matrix, target_var, family = "gaussian", alpha = 0.5, lambda = 0.001)
	filt_data_test <- the.data_test[which(the.data_test$Group == i),]
	npred_matrix <- as.matrix(filt_data_test[,15:30])
	pred_change <- predict(fit,newx = npred_matrix)
	filt_data_test$pred_chng <- pred_change
  newdf <- rbind(newdf,as.data.frame(filt_data_test))
	coef1 <- as.data.frame(t(as.matrix(coef(fit))))
	coef1$name <- i
coef_fit <- rbind(coef_fit,coef1)
rm(coef1)
}

write.Alteryx(coef_fit,1)

write.Alteryx(as.data.frame(as.matrix(newdf)),2)

I will appreciate

BridgetT · ‎04-05-2016

Hi @Ashish,

Just to clarify your problem: your code runs correctly now as long as you first convert newdf into a matrix and then into a dataframe? So you didn't need to perform this conversion for coef_fit? And the only error you were getting before was "S4 class is not subsettable"?

I can't say for sure what is going on without being able to see your data or a reproducible example with the same error. However, you should check out this page for some information about dataframes. Can you include your data so I can recreate the exact error you're getting? If your data is quite large, try taking a sample of it. If you get the same issue, that sample is probably enough to send me.

My hunch is that you're getting an issue because of the way you're "record-keeping" the results from each iteration of the for loop. I think the way you initialize an empty dataframe of NA's might be causing some unexpected behavior. Generally, you should try to avoid for loops in R as much as possible, because they can be quite slow. But when you do use them, you should vectorize your code as much as possible. (Here is a good resource about for loops and vectorization.) The basic idea behind vectorization is that we want to initialize a "skeleton" for whatever matrix, dataframe, list, etc. we are creating in the for loop. Even if vectorization doesn't solve your problem, it should speed up your code, especially if your data set is large. Repeatedly changing the size of a matrix or dataframe on each iteration using rbind can slow things down considerably.

I'm currently re-writing your code with vectorization. Can you try it an let me know if it fixes your problem?

Best,

Bridget

Bridget Toomey

Research Scientist, Analytic Products

Alteryx

BridgetT · ‎04-05-2016

Ok, here's the code. Let me know if it solves your error.

# Read the data stream into R
the.data_test <- read.Alteryx("#1", mode="data.frame")
the.data_train <- read.Alteryx("#2", mode="data.frame") 

require(glmnet)

#Instead of starting with the empty dataframes as you did, we'll initialize dataframes of the correct size to hold all the data.
#This data frame will hold the test data. Additionally, the last column will hold the predicted values for the test data.
#It'll play the same role as your newdf.
number.entries <- (NROW(the.data_test) * (NCOL(the.data_test) + 1))
test.data.out <- as.data.frame(matrix(vector(mode = "numeric", length = number.entries), nrow = NROW(the.data_test), ncol = (NCOL(the.data_test) + 1)))

#Create Group variable with unique group values to use to iterate through the loop
Grp <- unique(the.data_train[,31])

#Now that we've created Grp, we can initialize coef1.
#You inialized coef_fit to have 18 columns, but I'm fairly certain it should only have 17 columns.
#You are using columns 15-30 in your prediction matrix + 1 intercept coefficient = 16 + 1 = 17 columns.
#But please correct this if I'm wrong.
#First we initialize the number of entries that will be in the coefficient dataframe.
#You seem to want the entire coefficient matrix for each iteration. 
#(Recall that each call to glmnet will find your coefficients through an iterative process. The last step of this process will yield your "best" coefficients.)
#Since you didn't supply a value for the number of lambda's (ie the number of iterations), the program will choose the default number of 100.
number.coef.entries <- nrow(Grp) * 17 * 100
#Now we can initialize the dataframe to have all 0's. On each iteration of the for loop, we will update a new row.
all.coefs <- as.data.frame(matrix(vector(mode = "numeric", length = number.coef.entries), nrow = (nrow(Grp) * 100), ncol = 17))
#Initialize the counter to mark where we are on updating test.data.out
current.test.row <- 1
#This one will mark which row of the coef matrix we're on
current.coef.row <- 0

for (i in Grp){
  filt_data <- the.data_train[which(the.data_train$Group == i),]
  pred_matrix <- as.matrix(filt_data[,15:30])
  target_var <- as.matrix(filt_data[,14])
  #I took the liberty of eliminating the lambda parameter here.
  #The reference manual for glmnet says that typical usage is to not supply a value for lambda and to let the program compute it.
  #If you do want to supply values, you need to supply a decreasing sequence of lambda values rather than a single value.
  fit <- glmnet(pred_matrix, target_var, family = "gaussian", alpha = 0.5)
  filt_data_test <- the.data_test[which(the.data_test$Group == i),]
  npred_matrix <- as.matrix(filt_data_test[,15:30])
  pred_change <- predict(fit,newx = npred_matrix)
  filt_data_test$pred_chng <- pred_change
  test.data.out[current.test.row:NROW(filt_data_test),] <- filt_data_test
  coef1 <- as.data.frame(t(as.matrix(coef(fit))))
  coef1$name <- i
  coeff.start <- current.coef.row + 1
  coeff.end <- current.coef.row * 100
  all.coefs[coeff.start:coeff.end,] <- coef1
}

write.Alteryx(all.coefs,1)

write.Alteryx(test.data.out,2)

Bridget Toomey

Research Scientist, Analytic Products

Alteryx

BridgetT · ‎04-05-2016

Actually, the code I initially posted has a bug. Try this instead:

# Read the data stream into R
the.data_test <- read.Alteryx("#1", mode="data.frame")
the.data_train <- read.Alteryx("#2", mode="data.frame") 

require(glmnet)

#Instead of starting with the empty dataframes as you did, we'll initialize dataframes of the correct size to hold all the data.
#This data frame will hold the test data. Additionally, the last column will hold the predicted values for the test data.
#It'll play the same role as your newdf.
number.entries <- (NROW(the.data_test) * (NCOL(the.data_test) + 1))
test.data.out <- as.data.frame(matrix(vector(mode = "numeric", length = number.entries), nrow = NROW(the.data_test), ncol = (NCOL(the.data_test) + 1)))

#Create Group variable with unique group values to use to iterate through the loop
Grp <- unique(the.data_train[,31])

#Now that we've created Grp, we can initialize coef1.
#You inialized coef_fit to have 18 columns, but I'm fairly certain it should only have 17 columns.
#You are using columns 15-30 in your prediction matrix + 1 intercept coefficient = 16 + 1 = 17 columns.
#But please correct this if I'm wrong.
#First we initialize the number of entries that will be in the coefficient dataframe.
#You seem to want the entire coefficient matrix for each iteration. 
#(Recall that each call to glmnet will find your coefficients through an iterative process. The last step of this process will yield your "best" coefficients.)
#Since you didn't supply a value for the number of lambda's (ie the number of iterations), the program will choose the default number of 100.
number.coef.entries <- nrow(Grp) * 17 * 100
#Now we can initialize the dataframe to have all 0's. On each iteration of the for loop, we will update a new row.
all.coefs <- as.data.frame(matrix(vector(mode = "numeric", length = number.coef.entries), nrow = (nrow(Grp) * 100), ncol = 17))
#Initialize the counter to mark where we are on updating test.data.out
current.test.row <- 1
#This one will mark which row of the coef matrix we're on
current.coef.row <- 0

for (i in Grp){
  filt_data <- the.data_train[which(the.data_train$Group == i),]
  pred_matrix <- as.matrix(filt_data[,15:30])
  target_var <- as.matrix(filt_data[,14])
  #I took the liberty of eliminating the lambda parameter here.
  #The reference manual for glmnet says that typical usage is to not supply a value for lambda and to let the program compute it.
  #If you do want to supply values, you need to supply a decreasing sequence of lambda values rather than a single value.
  fit <- glmnet(pred_matrix, target_var, family = "gaussian", alpha = 0.5)
  filt_data_test <- the.data_test[which(the.data_test$Group == i),]
  npred_matrix <- as.matrix(filt_data_test[,15:30])
  pred_change <- predict(fit,newx = npred_matrix)
  filt_data_test$pred_chng <- pred_change
  test.data.out[current.test.row:NROW(filt_data_test),] <- filt_data_test
  coef1 <- as.data.frame(t(as.matrix(coef(fit))))
  coef1$name <- i
  coeff.start <- current.coef.row + 1
  coeff.end <- current.coef.row * 100
  all.coefs[coeff.start:coeff.end,] <- coef1
  current.test.row <- current.test.row + NROW(filt_data_test)
  current.coef.row <- current.coef.row + 100
}
#Filter out any potential cases where the glmnet algorithm terminated before 100 iterations:

write.Alteryx(all.coefs,1)

write.Alteryx(test.data.out,2)

Note that you may have some extra rows of all 0's in your dataframe of coefficients. You can remove these within Alteryx after writing it out by following these steps:

1. Sum the absolute values of all the columns using a Formula tool

2. Use a Filter tool to filter out all of the rows with values of 0 in the column you just created.

Bridget Toomey

Research Scientist, Analytic Products

Alteryx

Ashish · ‎04-05-2016

Thanks a lot Bridget for providing detail explanation and enduring through my buggy code to provide a working solution.

Few things I wanted to mention:

After resolving "S4 class not subsettable" error the error I was facing was ~"could not write YXDBstreaming" which got resolved when I converted from dataframe to matrix to again data frame. Now from your code I see how to do it correctly.
I initialize Coeff_matrix with 18 columns because I wanted to store the group name in coefficient matrix to be able to identify it later. therefore I added a new column "name" in the end
```
	coef1 <- as.data.frame(t(as.matrix(coef(fit))))
	coef1$name <- i
```

Also, Although I started without specifyng any value for lambda, I gave that 0.001 because since I was storing all the coefficient matrix in the loop it was maxing out the initialized matrix size and inbuilt alteryx variable renaming logic, so this value of lambda (due to my limited knowledge I did not know the impact) worked to give me exactly one answer in the coeff_martix. Now I realize that though this made the code work but it bugged the regression.
I will now try to store only the best coefficient matrix, which will ultimately be used for a Grp value.
Regarding not using loop in R, its a very good point. I initially did not want to, so can you suggest if there is a better way to do this type of exercise. I have previously used a Group variable in "lmlist" function of "nlme" but not sure if this is available in glmnet
```
# Read the data stream into R
the.data <- read.Alteryx("#1", mode="data.frame")
require(nlme)
Fit <- lmList(%Question.TargetVar% ~ %Question.XVars% | Group, data=the.data)
```
In past I have also tried batch macros in Alteryx, which takes a lot more time to open and close "R tool" for each iteration (my understanding - could be wrong), so I wanted to run the iterations within the R tool.

Now that I have a better code I will play around to learn and see how it turns out.

Although you have provided me with enough references of best practices in R, it will also be helpful if you can direct me to a good resource to understand regularize regression

and Best practices in alteryx while working with R tool.

Thanks again for your help.

Ashish Singhal

BridgetT · ‎04-05-2016

Hi @Ashish,

I'll just respond to your points in order:

After resolving "S4 class not subsettable" error the error I was facing was ~"could not write YXDBstreaming" which got resolved when I converted from dataframe to matrix to again data frame. Now from your code I see how to do it correctly.
- I'm glad it's working for you now! :)
I initialize Coeff_matrix with 18 columns because I wanted to store the group name in coefficient matrix to be able to identify it later. therefore I added a new column "name" in the end
```
	coef1 <- as.data.frame(t(as.matrix(coef(fit))))
	coef1$name <- i
```
- That makes sense! Sorry, I didn't realize that you were trying to do that before.
Also, Although I started without specifyng any value for lambda, I gave that 0.001 because since I was storing all the coefficient matrix in the loop it was maxing out the initialized matrix size and inbuilt alteryx variable renaming logic, so this value of lambda (due to my limited knowledge I did not know the impact) worked to give me exactly one answer in the coeff_martix. Now I realize that though this made the code work but it bugged the regression.
- Ok, now I understand where you're coming from. Instead of the solution I posted (it generates up to 100 rows of coefficients per iteration), you can just save the last row of the coefficient matrix during each iteration. I would strongly advise this approach instead of setting lambda equal to .001, since the algorithm behind glmnet can be quite sensitive to small changes in lambda. Unless you really know what you're doing, the algorithm will probably do a better job of picking a set of lambda values.
I will now try to store only the best coefficient matrix, which will ultimately be used for a Grp value.
- Ok, great! Yeah, just store the last row of the coefficient matrix on each iteration like I suggested above.
Regarding not using loop in R, its a very good point. I initially did not want to, so can you suggest if there is a better way to do this type of exercise. I have previously used a Group variable in "lmlist" function of "nlme" but not sure if this is available in glmnet
```
# Read the data stream into R
the.data <- read.Alteryx("#1", mode="data.frame")
require(nlme)
Fit <- lmList(%Question.TargetVar% ~ %Question.XVars% | Group, data=the.data)
```
- Based on skimming the glmnet guide, it seems that glmnet does not have an option to include a group variable. I was initially intrigued by something called the group lasso, but it seems that the group here refers to groups of coefficients rather than the groups you were looking for.
In past I have also tried batch macros in Alteryx, which takes a lot more time to open and close "R tool" for each iteration (my understanding - could be wrong), so I wanted to run the iterations within the R tool.
- Yes, you're completely right about this. Even though for loops in R aren't ideal, they're still faster than opening and closing the Alteryx R tool for each iteration of the batch macro.

As far as references go, this is a great one if you have a strong math background. If you don't have any Linear Algebra or any other courses beyond Calculus, this book is probably a better option for you. (The second book also has R examples, which might be especially useful for you.) The authors of the Elastic Net algorithm actually wrote both books with some other collaborators, so I think either one would be a great choice if you want to know more about the theory behind l1/l2 regularization.

Edit: The second book doesn't directly mention Elastic Net, but it does explain Lasso and Ridge Regression. So once you understand those methods, you can just think of Elastic Net as a compromise between them.

I don't know that many great references about best practices for the R tool, but here's one that covers some basics. Here and here are some more discussions on the Community about the R Tool. Finally, here is a recording of a webinar about using the R Tool, but I haven't listened to it personally yet. I may write an EngineWorks post in the future about the R Tool, since other customers besides you have also expressed interest. Also, I'd highly recommend attending the Inspire conference in San Diego if you can make it. I will be working in the Solutions Center there, along with the rest of the Content Engineering/Advanced Analytics teams. Many of us have considerable experience integrating R and Alteryx, as well as theoretical knowledge about predictive models. We'd be happy to help you solve any of your pressing predictive problems.

Edit: There's actually a special section for Advanced R users. Here's the description: "This session is directed towards Alteryx users who are already proficient in R and would like to extend its analytical capabilities further by writing custom code and macros. We will cover a wide range of topics, including: How to use Alteryx to perform out of memory scaling in R; how to manage R code in workflows in an IDE like RStudio; and how to create custom interactive visualizations using R." You're doing pretty well for someone who's just recently started R; I'd say that you'll know enough to attend that session in two months.

Best,

Bridget

Bridget Toomey

Research Scientist, Analytic Products

Alteryx

Ashish · ‎04-05-2016

Thanks a lot Bridget for the explanation and resources.

About Alteryx inspire - I got to know about it from the Alteryx hangout on youtube sometime around last week. unfortunately I will not be able to join it in San Jose, I am located in Ohio and will try to attend if in future it will be in my feasible reach.

Regards,

Ashish

BridgetT · ‎04-05-2016

Hi @Ashish,

You're very welcome! I'm glad I could help. And Inspire is actually in San Diego, but I totally understand how expensive it can be to fly to California if you're not on the West Coast! Hopefully next year's Inspire will be closer to Ohio so you can attend.

Cheers,

Bridget

Bridget Toomey

Research Scientist, Analytic Products

Alteryx

Alteryx Designer Desktop Discussions

Elastic Net/Lasso/Ridge Regression