community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

(Advice on) Use of PCA for Predictive Analysis

Asteroid

Dear Alteryx Community,

 

I have a few questions regarding the use of PCA for predictive analysis.

 

I have around 16 Xs (process parameters) which are leading to Y (product quality) for an industrial process.

 

Among the 16 Xs, 6 of them are related to the same equipment so I am considering the use of PCA to lower the computation needs since I believe there should be similarity in their behavior.

Once I have run the PCA analysis, I get 6 principal components as follows:

 

screenshot.2019-03-18 (2).png

 

My first conclusions are (and please correct me if am I wrong):

- The correlation between the Xs and PCs varies between -1 and 1

- PC1 is more related to the last 4 Xs (around 50% for each)

- PC2 is more related to "E7_Speed" (-0.92)

- By including PC1 to PC4 in my dataset instead of the 6Xs, I am still getting 95% of the variation representation.

 

After running a predictive model with the 4 PCs and the 10 remaining Xs, I am getting a new model. What should be the right way to test it with a new dataset?

Is there a way to extract the PCA formula to apply it to the new dataset? (I suspect it would not make sense to run a second PCA on the new data)

 

Thanks a lot for your advice,

Pierre-Louis

Highlighted
Alteryx
Alteryx

You are correct that it does not make sense to run a second PCA for your second dataset.

 

R provides the ability to extract the PCA formula and apply it to a new dataset with the predict() function. Unfortunately, this isn't built directly into the GUI. I would recommend using a R code node to build your PCA and apply the formula to the new dataset.

Asteroid

Thanks @AndrewKramer for your confirmation.

Unfortunately, using a R code node to extract the relevant information is far beyond my competencies... so far... but I will explore this option :-)

Alteryx
Alteryx

This link provides a pretty good example of how to do it. Let me know if you have any questions.

 

https://stats.stackexchange.com/questions/72839/how-to-use-r-prcomp-results-for-prediction

 

 

Asteroid

Hi @AndrewKramer !

I have started to lean Python as my 2000's programming skills (PHP mostly!) are obsolete!

Unfortunately, I cannot build on the example you have provided.

Would you be so kind to guide me on the example I have designed? 

Alteryx
Alteryx

I was able built a quick example similar to yours. I used the R code tool to construct the PCA model, then got the PCA values for the training and the validation dataset. I was then able to build my Gradient Boosting model on the training data, and score the validation.

pca.PNG

Here is my code in the R tool. Be sure to use read.Alteryx and write.Alteryx to pass data two and from Alteryx.

#Load Data
data <- read.Alteryx("#1", mode="data.frame")
new_data <- read.Alteryx("#1", mode="data.frame")

#Run PCA on original data
pca <- prcomp(data[c('age','balance','duration','previous','campaign')], retx=TRUE, center=TRUE, scale=TRUE)

#Score PCA on original data
#Select 4 Principal Components
pca_data <- data.frame(predict(pca, data[c('age','balance','duration','previous','campaign')])[,1:4])
pca_data$y <- data$y

#Score PCA on new data
#Select 4 Principal Components
pca_new_data <- data.frame(predict(pca, data[c('age','balance','duration','previous','campaign')])[,1:4])
pca_new_data$y <- new_data$y

#Write to Alteryx
write.Alteryx(pca_data, 1)
write.Alteryx(pca_new_data, 2)

Let me know if you have questions.

Asteroid

Hi @AndrewKramer !

Victory! I have been able to adapt your code, calculate the PCA parameters on the first set of data and apply it on a second one:

 

#Load Data
data <- read.Alteryx("#1", mode="data.frame")
new_data <- read.Alteryx("#2", mode="data.frame")

#Run PCA on original data
pca <- prcomp(data[,3:8], retx=TRUE, center=TRUE, scale=TRUE)

#Score PCA on original data
#Select 4 Principal Components
pca_data <- data.frame(predict(pca, data[,3:8])[,1:4])
pca_data$y <- data$y

#Score PCA on new data
#Select 4 Principal Components
pca_new_data <- data.frame(predict(pca, new_data[,3:8])[,1:4])
pca_new_data$y <- new_data$y

#Write to Alteryx
write.Alteryx(pca_data, 1)
write.Alteryx(pca_new_data, 2)

Thanks a lot for guiding me through the whole process, I owe you one!

Pierre-Louis

Alteryx
Alteryx

Glad to help

Labels