Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Elastic Net/Lasso/Ridge Regression

Ashish
8 - Asteroid

Hello,

 

I want to run elastic net regression, which I am trying to run by using "GLMNET" package in "R Tool". I do not have much experiece in R and I am running into errors "S4 class is not subsettable" and I am sure I will get more once I solve this one. 

Has any one sucessfully used Lasso/Ridge/Elasticnet regression in Alteryx? what package did you use and can you share with me how you did it ?

 

Thanks!

Ashish

20 REPLIES 20
Ashish
8 - Asteroid

Hi @bridgetT


Thanks for your help in past.

I am at this thread again as I am facing another problem in the same workwoflow. I am trying to use the weights in glmnet and I am getting error:  R (2) number of elements in weights (0) not equal to the number of rows of x (744).

 

Pasted below is the code that you helped me with earlier.

I have left your comments in from past to help remember the context.

 

I have added the weight variable in the code (highlighted) and I am not sure why its not working. I have tried this code in 'Rstudio' and it works in 'Rstudio' as is.

 

 

# Read the data stream into R
the.data_test <- read.Alteryx("#1", mode="data.frame")
the.data_train <- read.Alteryx("#2", mode="data.frame") 

require(glmnet)
require(stats)
#Instead of starting with the empty dataframes as you did, we'll initialize dataframes of the correct size to hold all the data.
#This data frame will hold the test data. Additionally, the last column will hold the predicted values for the test data.
#It'll play the same role as your newdf.
number.entries <- (NROW(the.data_test) * (NCOL(the.data_test) + 1))
#test.data.out <- as.data.frame(matrix(vector(mode = "numeric", length = number.entries), nrow = NROW(the.data_test), ncol = (NCOL(the.data_test) + 1)))
test.data.out <- as.data.frame(matrix(vector(mode = "numeric"), nrow = 0, ncol = (NCOL(the.data_test) + 1)))
#Create Group variable with unique group values to use to iterate through the loop
Grp <- unique(the.data_train[,34])
#Now that we've created Grp, we can initialize coef1.

#First we initialize the number of entries that will be in the coefficient dataframe.
#You seem to want the entire coefficient matrix for each iteration. 
#(Recall that each call to glmnet will find your coefficients through an iterative process. The last step of this process will yield your "best" coefficients.)
#Since you didn't supply a value for the number of lambda's (ie the number of iterations), the program will choose the default number of 100.

number.coef.entries <- sum(!is.na(Grp)) * 19

###Now we can initialize the dataframe to have all 0's. On each iteration of the for loop, we will update a new row.
##all.coefs <- as.data.frame(matrix(vector(mode = "numeric", length = number.coef.entries), nrow = (sum(!is.na(Grp))), ncol = 18))
all.coefs  <-as.data.frame(matrix(NA,nrow = 0, ncol = 19))

#Initialize the counter to mark where we are on updating test.data.out
current.test.row <- 1
#This one will mark which row of the coef matrix we're on
current.coef.row <- 0

for (i in Grp){
  filt_data <- the.data_train[which(the.data_train$Group == i),]
  if (nrow(filt_data) < 20) next()
  t_pred_matrix <- as.matrix(filt_data[,16:33])
	#t_pred_matrix <- as.matrix(scale(pred_matrix))
  target_var <- as.matrix(filt_data[,15])
  #I took the liberty of eliminating the lambda parameter here.
  #The reference manual for glmnet says that typical usage is to not supply a value for lambda and to let the program compute it.
  #If you do want to supply values, you need to supply a decreasing sequence of lambda values rather than a single value.
	##weights1 <- vector(mode = "numeric")  ## I tried to use vectors in hop that it will resolve the issue
  ##weights1 <- c(weights1,filt_data$weights)
  fit <- glmnet(t_pred_matrix, target_var,family = 'gaussian',weights = filt_data$weights, alpha = 0.5)
  filt_data_test <- the.data_test[which(the.data_test$Group == i),]
  if (nrow(filt_data_test) < 2) next()
  npred_matrix <- as.matrix(filt_data_test[,16:33])
  pred_change <- predict(fit,newx = npred_matrix)
  filt_data_test$pred_chng <- pred_change[,ncol(pred_change)]
  #test.data.out[current.test.row: current.test.row +NROW(filt_data_test),] <- filt_data_test
	test.data.out <- rbind(test.data.out, filt_data_test[nrow(filt_data_test),])
  coef1 <- as.data.frame(t(as.matrix(coef(fit))))
  coef1$name <- i
  all.coefs <- rbind(all.coefs,coef1[nrow(coef1),])
#  coeff.start <- current.coef.row + 1
#  coeff.end <- current.coef.row * 100
#  all.coefs[coeff.start:coeff.end,] <- coef1
#  current.test.row <- current.test.row + NROW(filt_data_test)
#  current.coef.row <- current.coef.row + 100
}
#Filter out any potential cases where the glmnet algorithm terminated before 100 iterations:

write.Alteryx(all.coefs,1)
write.Alteryx(test.data.out,2)

 

Looking forward for help again.

Thanks,

Ashish Singhal

BridgetT
Alteryx Alumni (Retired)

Hi @Ashish,

 

Did you get that same error with both approaches you tried (making the weights1 vector, and just directly trying to take the weights column)? It seems like R thinks that you have no columns in filt_data called weights. But I'm a bit mystified as to why the code works in RStudio, but not in Alteryx. Did you do anything in your workflow before reading the data in, or did you read it directly from the same file you read from in RStudio? The only time I've seen differences between RStudio and the Alteryx R tool is when strings are treated as factors in Alteryx but not RStudio. Does your weights vector have any missing values in it that got coded as the string "NA" or some other string? Then Alteryx would probably read it in as a factor. The weights argument in glmnet() only takes numeric values, but I'd expect to see an error complaining that you tried to give it non-numeric values rather than that you gave it no values at all. But maybe it has some surprising error handling.

 

Best,

Bridget

Bridget Toomey

Research Scientist, Analytic Products

Alteryx
Ashish
8 - Asteroid

Hi @Bridget,

 

Thanks for your thoughts.

 

Yes, I am getting the same error in both the approaches.

 

I am reading the same file in from the database, but there is difference in how I am adding the weight variable. In alteryx I am using the formula tool to add the weight variable, while in RStudio I am using R statements to create the weight variable.

 

R code and Alterxy workflow image pasted below.

cn <- odbcDriverConnect(connection="Driver={SQL Server Native Client 11.0};server=MSS3107;database=AdvancedAnalytics;trusted_connection=yes;")
Input <- sqlQuery(cn, "select * from [AdvancedAnalytics].[dbo].[PRICING_MODEL]")

# Create group variable by concatinating city and company
Input$Group <- paste(Input$City,Input$State, Input$Company, Input$Type, Input$`Prediction Commodity`, sep = "|")
Input$weights <- difftime(Input$Date,"2013-01-01",units = "weeks")+20
the.data_test <- Input[ Input$Date >= as.POSIXct("2015-12-31"),]
the.data_train <- Input[ Input$Date <= as.POSIXct("2015-12-31"),]

2016-07-01_14-19-18.png

 

 

I am doing this because I wanted to compare results of different weighting scheme's, and I have duplicated the modelling piece of the workflow with different formula for weights (hidden in container). while in Rstudio I was just trying to run it once with any weight. There are no missing values or NA in the weights variable.

 

Thanks,

Ashish

 

(P.S. Can you please tell me how to use @ to notify someone when you want to mention them)

 

 

BridgetT
Alteryx Alumni (Retired)

Hi @Ashish,

 

It looks like you're naming the variable Weights (with a capital W) with the Formula tool. R is case sensitive, so trying to extract the column weights when you really want the column Weights won't give you anything. You should either change the name to weights with the Formula tool or edit your R code to extract Weights.

 

To tag people, just type @ and then the first letter of their name. Then a list of relevant people should show up, and you can click on the person you want to tag.

 

Best,

Bridget

Bridget Toomey

Research Scientist, Analytic Products

Alteryx
Ashish
8 - Asteroid

Thanks a lot Bridget,

 

It resolved the issue.

I am ashamed that I spent close to 8 hours (almost whole day yesterday) trying to fix it.

 

Have no words to explain my facepalm right now.

 

Regards,

Ashish

BridgetT
Alteryx Alumni (Retired)

You're welcome, @Ashish! And don't feel bad! Sometimes it just takes another set of eyes to notice these things. I've definitely been in similar situations myself.

Bridget Toomey

Research Scientist, Analytic Products

Alteryx
Ashish
8 - Asteroid

Hi @BridgetT

 

In another experiment I am in a similar situation where the code works fine in R studio but there is a warning in Alteryx, which leads to undesirable results.

 

I am trying to subset my data based on date within R tool, the reason I am doing it in R tool (instead of using alteryx filter tool) is because I want to use function model.matrix for categorical variable before splitting my data and the reason for that is subsetted datasets may or may not have all the factors.

 

Pasted is the code I am trying

 

# Read the data stream into R
Input <- read.Alteryx("#1", mode="data.frame")


season_matrix <- model.matrix(Input$Change ~Input$Season)[,-1]
Input <- cbind(Input,season_matrix)

#Input$Date <- as.Date(Input$Date)
the.data_test <- Input[ Input$Date >= as.POSIXct("2015-12-31"),]
the.data_train <- Input[ Input$Date <= as.POSIXct("2015-12-31"),]

#the.data_test <- Input[ format(Input$Date,"%Y-%m-%d") >= "2015-12-31",]
#the.data_train <- Input[ format(Input$Date,"%Y-%m-%d") <= "2015-12-31",]

#the.data_test <- subset(Input , as.Date(Date) > as.POSIXct("2015-12-31"))
#the.data_train <- subset(Input , as.Date(Date) <= as.POSIXct("2015-12-31"))

But I am getting this warning.

 

R (34) In `[.data.frame`(Input, Input$Date >= as.POSIXct("2015-12-31"), :
R (34) Incompatible methods ("Ops.factor", "Ops.POSIXt") for ">="

Thanks in advance for your help.

 

Regards,

Ashish

 

P.S. using @ symbol seems not to work for me to notify you :(

Ashish
8 - Asteroid

ok, I figured out the fix for the problem. Although I do not fully comprehend why it was not working earlier and why its working now, but somehow it worked.

 

I used as.Date function on both sides of operator to coerce the type.

 

require(glmnet)
require(stats)
# Read the data stream into R
Input <- read.Alteryx("#1", mode="data.frame")
season_matrix <- model.matrix(Input$Change ~Input$Season)[,-1]
Input <- cbind(Input,season_matrix)

#Input$Date <- as.Date(Input$Date)
#the.data_test <- Input[ Input$Date >= as.POSIXct("2015-12-31"),]
#the.data_train <- Input[ Input$Date <= as.POSIXct("2015-12-31"),]

the.data_test <- Input[ as.Date(Input$Date) >= as.Date("2015-12-31"),]
the.data_train <- Input[ as.Date(Input$Date) <= as.Date("2015-12-31"),]

The "Date" variable in the dataset was already of type date so I did not suspect it earlier.

 

Also when I tried formula tool create a new Date variable date format that did not work either.

 

I am relieved that I've found a fix but I lack the understanding what works and why.

 

Thanks,

BridgetT
Alteryx Alumni (Retired)

Hi @Ashish,

 

The problem is that the read.Alteryx function does not natively read in date types in Alteryx as date types in R. I'm attaching a small example to show this phenomenon. (Note that you'll need to have the package DescTools installed in order to run it.) I'm not sure exactly how you got the date field to read in as a date type on RStudio but not the R tool, though. What type of file were you using?

Best,

Bridget

Bridget Toomey

Research Scientist, Analytic Products

Alteryx
Ashish
8 - Asteroid

 

Hi Bridget,

 

Thanks for the example you shared. I also tried this workflow on my file and it converty the date into a V_Wstring

My source file is a SQL Sever table and the Date filed is of the format " yyyy-mm-dd hh:mm:ss", 

I use following connection string to connect and it recognize the filed as date time.

cn <- odbcDriverConnect(connection="Driver={SQL Server Native Client 11.0};server=MSS1234;database=AdvancedAnalytics;trusted_connection=yes;")
Input <- sqlQuery(cn, "select * from [AdvancedAnalytics].[dbo].[PRICING_MODEL]")

 

Regards,

Ashish

 

 

Labels