Alteryx Designer Desktop Discussions

NM · ‎07-25-2016

What is the best tool to create factors off of predictor variables?

For example, I have 25 predictor variables and would like to create 5 factors out of it. Currently, I am using Principal Components tool which gives me those 5 factors but does not tell me that factor is composed/created off of which predictor variables.

Any reading material on how the calculation works to create principal components would be helpful.

Thanks!

RodL · ‎07-25-2016

I assume you are differentiating between Factor Analysis and PCA?

Alteryx doesn't have an "out of the box" Factor Analysis tool, but you could potentially build one using the Psych Package.

As opposed to Factor Analysis, which focuses on accounting for just the correlations between the variables, PCA essentially tries to answer how to efficiently summarize the data in a lesser number of variables (with the majority of the variance in the first component, less in the 2nd component, and so on). So the components are created from ALL of the variables you selected for PCA, and the first few components will typically give you enough of the variance of all of the variables to run valid analysis.

I believe Alteryx uses the prcomp function that is found in the "stats" R package. Information on this is here.

NM · ‎07-26-2016

Thanks. This is helpful.

After running PCA, I get component loadings (which I believe are the coefficients) and value of new latent variables (PC1 ,PC2, PC3... so on).

1. How are these values of PC1, PC2, .. calculated? Is there any standard method ?

2. How do we decide optimal number of PCs to create or how do we decide how many new variables to create?

NM · ‎07-26-2016

To provide you better perspective, I have attached sample data file, with base data, component loading and results.

I am curious to understand how PC1, PC2, PC3 values (in "PC Values" tab) are calculated.

Thanks!

NM · ‎07-26-2016

Another point, I ran linear and stepwise regression on the PC values obtained after running PCA and found that there is a difference between values of r-squared and adjusted r-squared (with adjusted value less than actual value). What could be an explanation to this? (asking this as I understand that PCs are not correlated and this should remove any difference b/w adjusted and real r-square value)

RodL · ‎07-28-2016

@NM,

From your first response back, there's probably still some confusion on what the principal components are vs. factors are. I say this because you mention PC1, PC2, etc. as "latent variables". Factor analysis provides the latent variables that might be the underlying cause of the variables you do a factor analysis on, but principal components are merely capturing the variance that are in the variables used in a PC analysis. So with PCA, there are no "hidden" variables. Each PC captures a certain amount of the variance of all of the variables, with PC1 capturing the most and working it's way down.

For your first question, you would need to investigate further on the properties of the "prcomp" function. The PCs are calculated somehow within the algorithm I mentioned in my earlier post, and I wouldn't propose that I could even try to explain that process.

As to your second question related to how many PCs you would want to include, look at the resulting report from the PCA tool. You are basically trying to get as much of the variance included in the fewest amount of PCs. So looking at the Component Summary, the Cumulative Proportion combines the Proportion of Variance as more PCs are accounted for. When I was consulting I would typically try to get at least 80% of the variance represented. Another "rule of thumb" method is to look at the Scree Plot included in the report and determine where it begins to "flatten out". That "elbow" is where you can expect to get the most variance with the least number of PCs.

The thing to understand is that, as mentioned in Help on the PCA tool, using PCA can benefit by reducing the number of variables, but at the cost of making those variables harder to interpret in how they effect whatever model you are trying to build. I typically will use it on "related" variables.

So for example from your data, you have variables for a number of products. If Product A-E are products related to a specific product category (e.g., children's clothing), I would use PCA on just those 5 variables to possibly cut "children's clothing" down to 1 or 2 variables. This way though, when I use the one (or two) resulting PCs in my model, I can still explain the effect that "children's clothing" has on my model. Then if Products F-J are related (e.g., men's clothing), I would do the same thing and create PCs on just those variables. Then I would have 1 or 2 variables representing most of the variance of children's clothing and another 1 or 2 variables representing men's clothing.

If however they are all product categories in themselves, I'm not sure I would use PCA since what you would get is the variance across multiple categories that would then be difficult to explain within your subsequent model. It will work, but it's just not as easy to operationalize a model if you can't explain where the effect of product categories result. So I would instead probably try to determine which product categories have the greatest effect on your model and keep those and eliminate the variables that have minimal significance.

Hope this makes sense.

As to your question on r-squared and adjusted r-squared, I'm not really qualified to respond to that, although my understanding is that typically the adjusted r-squared is less just by nature of the calculations. See r-squared/adj r-squared explanation.