PCA is a critical method for dimension reduction for high-dimensional data. High-dimensional data are data with features (p) a lot more than observations (n). However, this is changing with single-cell RNAseq data. Now, we can sequence millions (n) of single cells and each cell has ~20,000 genes/features (p).
I suggest you read my previous blog post on using svd
to calculate PCs.
Single-cell expression data PCA
In single-cell RNAseq analysis, feature selection will be performed first. e.g. In Seruat
, most variable genes will be calculated by FindVariableGenes
and will be used for downstream analysis. The number of variable genes is in
the range of a couple of thousands (~2000). This further reduced number of features(p).
Let’s take a look at the source code of Seurat
for
PCA:
if (rev.pca) {
pcs.compute <- min(pcs.compute, ncol(x = data.use)-1)
pca.results <- irlba(A = data.use, nv = pcs.compute, ...)
sdev <- pca.results$d/sqrt(max(1, nrow(data.use) - 1))
if(weight.by.var){
gene.loadings <- pca.results$u %*% diag(pca.results$d)
} else{
gene.loadings <- pca.results$u
}
cell.embeddings <- pca.results$v
}
else {
pcs.compute <- min(pcs.compute, nrow(x = data.use)-1)
pca.results <- irlba(A = t(x = data.use), nv = pcs.compute, ...)
gene.loadings <- pca.results$v
sdev <- pca.results$d/sqrt(max(1, ncol(data.use) - 1))
if(weight.by.var){
cell.embeddings <- pca.results$u %*% diag(pca.results$d)
} else {
cell.embeddings <- pca.results$u
}
}
rownames(x = gene.loadings) <- rownames(x = data.use)
colnames(x = gene.loadings) <- paste0(reduction.key, 1:pcs.compute)
rownames(x = cell.embeddings) <- colnames(x = data.use)
colnames(x = cell.embeddings) <- colnames(x = gene.loadings)
and the help page for {Seruat::RunPCA}
:
pc.genes
Genes to use as input for PCA. Default is object@var.genes
rev.pca
By default computes the PCA on the cell x gene matrix. Setting to true will compute it on gene x cell matrix.
Seurat
uses irlba (Fast Truncated Singular Value Decomposition and Principal Components Analysis for Large Dense and Sparse Matrices) for PCA.
The irlba
is both faster and more memory efficient than the usual R svd
function for computing a few of the largest singular vectors and corresponding singular values of a matrix.
By default, RunPCA
computes the PCA on the cell (n) x gene (p)
matrix.
One thing to note is that in linear algebra, a matrix is coded as n (rows are observations) X p (columns are features). That’s why by default, the gene x cell
original matrix is transposed first to cell x gene
: irlba(A = t(x = data.use), nv = pcs.compute, ...)
.
After irlba
, the v
matrix is the gene loadings, the u
matrix is the cell embeddings.
number of significant PCs
For downstream analysis, e.g. {Seurat::FindClusters}
only the PCs that significantly contribute to the variation of the data are used. Seruat
uses JackStraw
and JackStrawplot
function to achieve it.
JackStraw
:
Randomly permutes a subset of data, and calculates projected PCA scores for these ‘random’ genes. Then compares the PCA scores for the ‘random’ genes with the observed PCA scores to determine statistical significance. End result is a p-value for each gene’s association with each principal component.
JackStrawplot
:
Plots the results of the JackStraw analysis for PCA significance. For each PC, plots a QQ-plot comparing the distribution of p-values for all genes across each PC, compared with a uniform distribution. Also determines a p-value for the overall significance of each PC.The p-value for each PC is based on a proportion test comparing the number of genes with a p-value below a particular threshold (score.thresh), compared with the proportion of genes expected under a uniform distribution of p-values.
The other day, I saw a tweet on permute the original matrix to calculate the significance of the PCs. I forget the original tweet, but this is from a retweet: https://twitter.com/MattOldach/status/1075037756563382274
references: This is called Horn’s Parallel Analysis (original paper https://link.springer.com/article/10.1007%2FBF02289447 and a modification https://journals.sagepub.com/doi/abs/10.1177/0013164495055003002?journalCode=epma. It’s a great method for removing noisy components.
This is not exactly the same as what Seurat
is doing, but the idea is similar.
I want to put it down here for my future reference.
permutation “test” for PCA
The code below is copied from that tweet, credit goes to the author.
pca_eigenperm<- function(data, nperm = 1000){
pca_out<- prcomp(data, scale. = T)
eigenperm<- data.frame(matrix(NA, nperm, ncol(data)))
n<- ncol(data)
data_i<- data.frame(matrix(NA, nrow(data), ncol(data)))
for (j in 1: nperm){
for (i in 1:n){
data_i[,i]<- sample(data[,i], replace = F)
}
pca.perm<- prcomp(data_i, scale. = T)
eigenperm[j,]<- pca.perm$sdev^2
}
colnames(eigenperm)<- colnames(pca_out$rotation)
eigenperm
}
We will use the same NCI60
data set for demonstration.
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.8
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.3.1 ✔ forcats 0.3.0
## ── Conflicts ─────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(ISLR)
ncidat<- NCI60$data
rownames(ncidat)<- NCI60$labs
dim(ncidat)
## [1] 64 6830
fa_pca_perm<- pca_eigenperm(t(ncidat))
fa_pca<- prcomp(t(ncidat))
fa_pca_rand95<-
data.frame(Random_Eigenvalues = sapply(fa_pca_perm, quantile, 0.95)) %>%
#95% percentile of randome eigenvalues
mutate(PC = colnames(fa_pca$rotation)) %>%
#add PC IDs as discrete var
cbind(Eigenvalues = fa_pca$sdev^2)
#combine rand95 with real eigenvals
## only the first 9 PCs
fa_pca_rand95_long<-
gather(fa_pca_rand95[1:9, ], key = Variable, value = Value, -PC)
ggplot(fa_pca_rand95_long, aes(PC, Value, fill = Variable)) +
geom_bar(stat = "identity", position = position_dodge())+
labs(y="Eigenvalue", x="", fill= "") +
theme_classic()
We see after PC6, the Eigenvalues are almost the same with the permuted data.
For single cell data, permutation can take a long time, that’s why in JackStraw
there is an
option prop.freq
(Proportion of the data to randomly permute for each replicate) to
permute only a subset of the data matrix.