Hi All,
I'm doing an evaluation of Alteryx Designer for my company. I need to build a linear regression model in Alteryx. I have both qualitative and quantitative predictor variables. I originally created the model using the statistical software Minitab. I would like to be able to create a first-order main effects model, a first order interaction model, and a second order model. I did try using the Linear Regression tool in my workflow, but it doesn't seem to have the options to add interactions, or higher-order terms. The tool also didn't have an option to identify my qualitative predictors apart from my quantitative predictors, and didn't give me a least-squares regression equation.
Any ideas?
Thank you!
-Jennifer
Solved! Go to Solution.
Hi @JenniferO
It is certainly possible to create all of the components of your linear regression project in Alteryx. To create a first-order main effects model, you would simply run your data through the Linear Regression Tool. For the first order interaction model, you will simply need to create your interaction terms using a Formula Tool ([Field1]*[Field2]), and then plug those interaction terms into the Linear Regression Tool. The same concept applies to creating a second order model. You could first create the necessary variables using the Formula Tool by squaring each of your variables of interest (the pow() function will work nicely), and then push those variables into a third Linear Regression Tool. This way, you have total control over how each of your interaction and second order variables are created, and the corresponding models are generated.
The Linear Regression tool automatically determines variable types based on the field data type, so it is important to make sure your categorical variables are string type (even if they are represented by numbers) and your continuous variables are a numeric data type. You can use a Select Tool to adjust any data types prior to generating a model using the Linear Regression Tool.
If you are looking for the regression equation of the coefficients of the generated regression equation are included in the "R" output of the model. There is also an R programming language model object output in the "O" anchor. If you would like your coefficients put out as data, please check out the Model Coefficients Tool, available in the Predictive District of the Alteryx Analytics Gallery. Simply connect this tool to the "O" output anchor of your Linear Regression Tool, and you will get the coefficients of your equation to use in your data stream.
If you are instead referring to the method by which the linear regression is modeled, by default, the tool generates an ordinary least-squared regression (OLS). You can generate a weighted least squared regression by selecting the Use a weight variable for weighted least squares in the customize model panel, or a regularized regression by checking the Use regularized regression option. For more information on regularized regression in Alteryx, please see this Community Knowledge Base Article.
Does this answer all of your questions on using the Linear Regression Tool? Are there any further questions I might be able to help you with? Please let me know!
Hi @SydneyF,
I'm sorry for the delayed reply. Thank you SO MUCH for the detailed response. All of the information you provided was extremely helpful!
Thanks again!
-Jennifer
Hi @SydneyF
I was wondering if you could clarify something for me. If the variables I want to make interaction terms between are categorical and not numerical, the syntax in R is the same. But in your example for creating interaction terms in the formula tool, I would not be able to use mathematical operators. Instead, I would create a string out of both variables, yes? like so, [Department] + "X" + [Location] and then use as an interaction term in the model?
What about an interaction term between one categorical and numerical variable?
Thank you,
Max Warburg
Hi @maxwarburg,
I think to understand the best way to approach interaction terms with categorical variables, it is important to understand how the regression tools handle categorical variables under the hood.
For categorical variables to be included as predictor variables in a linear or logistic regression, they need to be converted to a numeric format. This is accomplished with one-hot encoding, which converts each value in a categorical variable into its own binary variable, where a "1" indicates the record belongs to that category, and a "0" indicates they do not. The regression tool will then exclude one of the resulting variables as a reference group and will assume that any rows that have 0's for all of the variables in a group belong to that excluded value. To help make this example as clear as possible, I have not excluded any encoded variables as a reference group.
If you'd like to create an interaction term for two categorical variables, then you could create a new column where the values are the concatenated values of the original variables like you've described. The tool will perform the one-hot encoding for each resulting combination (as well as the original variables) under the hood.
However, you will want to be careful of how many unique combinations you are creating with the interaction term, as each unique combination will result in a new (potentially sparse) column. If there are not a lot of records (observations) for each resulting combination, the algorithm will have a difficult time identifying meaningful patterns with the generated variables. Also, the generated combinations might not add much to interpretability.
With that in mind, you can create an interaction term between a categorical variable and a numeric variable by first one-hot encoding the categorical variable, and then multiplying the resulting fields by the numeric variable.
There is a one-hot-encoding macro available on the Alteryx Gallery, and you can use a Multi-Field Formula tool to do the multiplication.
Hope this helped!
Sydney,
Thank you so much, I need some time to digest this fully but this completely answers my question and enables me to move forward with my analysis.
The model I am running is for a pay equity analysis so I'm doing feature engineering to allow looking at interactions without making the data so so sparse.
I will incorporate this and let you know the outcome!
Thank you so much Sydney!