marker gene selection using logistic regression and regularization for scRNAseq

May 15, 2022 11 min read bioinformatics, single-cell, machine learning

why this blog post?

I saw a biorxiv paper titled A comparison of marker gene selection methods for single-cell RNA sequencing data

Our results highlight the efficacy of simple methods, especially the Wilcoxon rank-sum test, Student’s t-test and logistic regression

I am interested in using logistic regression to find marker genes and want to try fitting the model in the tidymodel ecosystem and using different regularization methods.

I highly recommend you to watch the statquest videos on those topics.

Let’s use PBMC single-cell RNAseq data as an example.

Load libraries

library(Seurat)
library(tidyverse)
library(tidymodels)
library(scCustomize) # for plotting
library(patchwork)

Preprocess the data

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "~/blog_data/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)

## routine processing
pbmc<- pbmc %>% 
  NormalizeData(normalization.method = "LogNormalize", scale.factor = 10000) %>%
  FindVariableFeatures(selection.method = "vst", nfeatures = 2000) %>%
  ScaleData() %>%
  RunPCA(verbose = FALSE) %>%
  FindNeighbors(dims = 1:10, verbose = FALSE) %>%
  FindClusters(resolution = 0.5, verbose = FALSE) %>%
  RunUMAP(dims = 1:10, verbose = FALSE)

## the annotation borrowed from Seurat tutorial
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono", 
    "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
pbmc$cell_type<- Idents(pbmc)
DimPlot(pbmc)

Marker gene detection using differential expression analysis between clusters.

pbmc_subset<- pbmc[, pbmc$cell_type %in% c("NK", "B")]
DimPlot(pbmc_subset)

Idents(pbmc_subset)<- pbmc_subset$cell_type

diff_markers<- FindMarkers(pbmc_subset, ident.1 = "B", ident.2 = "NK", test.use = "wilcox")

diff_markers %>%
        arrange(p_val_adj, desc(abs(avg_log2FC))) %>%
        head(n = 20)

#>                 p_val avg_log2FC pct.1 pct.2    p_val_adj
#> GZMB     7.646318e-98  -5.553778 0.017 0.966 1.048616e-93
#> GZMA     2.929923e-95  -4.519094 0.011 0.939 4.018097e-91
#> PRF1     9.684091e-95  -4.635264 0.029 0.959 1.328076e-90
#> NKG7     8.032681e-92  -6.310372 0.083 0.986 1.101602e-87
#> CST7     4.498511e-91  -4.389481 0.037 0.946 6.169258e-87
#> CTSW     1.747566e-89  -4.144031 0.066 0.966 2.396612e-85
#> FGFBP2   3.777986e-87  -4.555937 0.034 0.912 5.181130e-83
#> GNLY     8.666212e-85  -6.080298 0.080 0.939 1.188484e-80
#> FCGR3A   1.278458e-79  -3.635596 0.034 0.858 1.753278e-75
#> CD247    1.428149e-77  -3.763002 0.040 0.851 1.958563e-73
#> GZMM     3.440488e-77  -3.248116 0.017 0.818 4.718285e-73
#> CD7      4.578399e-74  -3.330969 0.054 0.851 6.278817e-70
#> TYROBP   2.340536e-73  -3.560921 0.103 0.899 3.209811e-69
#> FCER1G   9.343826e-72  -3.499007 0.069 0.845 1.281412e-67
#> SPON2    7.917147e-70  -3.506623 0.011 0.743 1.085758e-65
#> CD74     1.880522e-69   3.659825 1.000 0.824 2.578947e-65
#> HLA-DRA  2.083409e-69   4.597057 1.000 0.365 2.857187e-65
#> SRGN     1.501958e-68  -3.151188 0.201 0.932 2.059785e-64
#> HLA-DPB1 3.257314e-66   3.595454 0.986 0.345 4.467080e-62
#> HLA-DRB1 2.058185e-65   3.710470 0.980 0.189 2.822594e-61

Note that p-values from this type of analysis are inflated as we clustered the cells first and then find the differences between the clusters. In other words, we double-dip the data.

see slide 27 from my ABRF2022 talk https://divingintogeneticsandgenomics.rbind.io/files/slides/2022-03-29_single-cell-101.pdf

Let’s use logistic regression to find marker genes

We can use logistic regression to classify B-cells and NK cells.

## re-process the data, as the most-variable genes will change when you only have NK and B cells vs all cells. use log normalized data
pbmc_subset<- pbmc_subset %>% 
  NormalizeData(normalization.method = "LogNormalize", scale.factor = 10000) %>%
  FindVariableFeatures(selection.method = "vst", nfeatures = 2000) %>%
  ScaleData() %>%
  RunPCA(verbose = FALSE) %>%
  FindNeighbors(dims = 1:10, verbose = FALSE) %>%
  FindClusters(resolution = 0.1, verbose = FALSE) %>%
  RunUMAP(dims = 1:10, verbose = FALSE)

we use scaled data, because it was z-score transformed, so the highly expressed genes will not dominate the model

data<- pbmc_subset@assays$RNA@scale.data
# let's transpose the matrix and make it to a dataframe
dim(data)

#> [1] 2000  497

data<- t(data) %>%
        as.data.frame()

# now, every row is a cell with 2000 genes/features/predictors. 
dim(data)

#> [1]  497 2000

## add the cell type/the outcome/y to the dataframe
data$cell_type<- pbmc_subset$cell_type
data$cell_barcode<- rownames(data)
## it is important to turn it to a factor for classification
data$cell_type<- factor(data$cell_type)

prepare the data

set.seed(123)

data_split <- initial_split(data, strata = "cell_type")
data_train <- training(data_split)
data_test <- testing(data_split)

# 10 fold cross validation
data_fold <- vfold_cv(data_train, v = 10)

We will not look at the test data at all until we test our model.

Ridge Regression

In our training data, we have 2000 genes/features (p) and 273 cells/observations (n) and p >> n, so we will need to enforce sparsity of the model by regularization.

We’ll set the penalty argument to tune() as a placeholder for now. This is a model hyper parameter that we will tune to find the best value for making predictions with our data. Setting mixture to a value of one means that the glmnet model will potentially remove irrelevant predictors and choose a simpler model.

Let’s use L2 regularization/norm for ridge regression by specifying mixture = 0. and we need to tune the penalty hyper parameter.

ridge_spec<-
        logistic_reg(penalty = tune(), mixture = 0) %>%
        set_engine("glmnet") %>%
        set_mode("classification")

ridge_recipe <- 
  recipe(formula = cell_type ~ ., data = data_train) %>% 
  update_role(cell_barcode, new_role = "ID")  %>%
  step_zv(all_predictors())

# step_normalize(all_predictors())
## the expression is already scaled, no need to do step_normalize

ridge_workflow <- workflow() %>% 
  add_recipe(ridge_recipe) %>% 
  add_model(ridge_spec)

tune the penalty hyper-parameter.

penalty_grid <- grid_regular(penalty(range = c(-5, 5)), levels = 50)
#penalty_grid
tune_res <- tune_grid(
  ridge_workflow,
  resamples = data_fold, 
  grid = penalty_grid
)

# tune_res
# check which penalty is the best 
autoplot(tune_res)

best_penalty <- select_best(tune_res, metric = "accuracy")
best_penalty

#> # A tibble: 1 × 2
#>   penalty .config              
#>     <dbl> <chr>                
#> 1 0.00001 Preprocessor1_Model01

ridge_final <- finalize_workflow(ridge_workflow, best_penalty)
ridge_final_fit <- fit(ridge_final, data = data_train)

tidy(ridge_final_fit) %>% arrange(desc(abs(estimate)))

#> # A tibble: 2,000 × 3
#>    term        estimate penalty
#>    <chr>          <dbl>   <dbl>
#>  1 (Intercept)  -1.21   0.00001
#>  2 NKG7          0.0318 0.00001
#>  3 GZMB          0.0307 0.00001
#>  4 HLA-DRA      -0.0307 0.00001
#>  5 CTSW          0.0305 0.00001
#>  6 GNLY          0.0297 0.00001
#>  7 PRF1          0.0296 0.00001
#>  8 GZMA          0.0295 0.00001
#>  9 CD74         -0.0295 0.00001
#> 10 FGFBP2        0.0295 0.00001
#> # … with 1,990 more rows

# augment(ridge_final_fit, new_data = data_test)

we see the NKG7, GZMB, CTSW GNLY, PRF1 etc on the top as the marker genes for NK cells which makes sense!

## confusion matrix
predict(ridge_final_fit, new_data = data_test) %>%
        bind_cols(data_test %>% select(cell_type)) %>%
        conf_mat(truth = cell_type, estimate = .pred_class)

#>           Truth
#> Prediction  B NK
#>         B  88  0
#>         NK  0 37

Lasso regression

You need to use logistic_reg() and set mixture = 1 to specify a lasso model. The mixture argument specifies the amount of different types of regularization, mixture = 0 specifies only ridge regularization and mixture = 1 specifies only lasso regularization.

Lasso will remove highly correlated features which may not be what we want here!

lasso_spec<-
        logistic_reg(penalty = tune(), mixture = 1) %>%
        set_engine("glmnet") %>%
        set_mode("classification")

lasso_recipe <- 
  recipe(formula = cell_type ~ ., data = data_train) %>% 
  update_role(cell_barcode, new_role = "ID")  %>%
  step_zv(all_predictors())

# step_normalize(all_predictors())
## the expression is already scaled, no need to do step_normalize

lasso_workflow <- workflow() %>% 
  add_recipe(lasso_recipe) %>% 
  add_model(lasso_spec)

penalty_grid <- grid_regular(penalty(range = c(-5, 5)), levels = 50)
#penalty_grid
tune_res <- tune_grid(
  lasso_workflow,
  resamples = data_fold, 
  grid = penalty_grid
)

# tune_res
# check which penalty is the best 
autoplot(tune_res)

best_penalty <- select_best(tune_res, metric = "accuracy")
best_penalty

#> # A tibble: 1 × 2
#>   penalty .config              
#>     <dbl> <chr>                
#> 1 0.00001 Preprocessor1_Model01

lasso_final <- finalize_workflow(lasso_workflow, best_penalty)
lasso_final_fit <- fit(lasso_final, data = data_train)

## confusion matrix
predict(lasso_final_fit, new_data = data_test) %>%
        bind_cols(data_test %>% select(cell_type)) %>%
        conf_mat(truth = cell_type, estimate = .pred_class)

#>           Truth
#> Prediction  B NK
#>         B  88  0
#>         NK  0 37

lasso_features<- tidy(lasso_final_fit) %>% 
        arrange(desc(abs(estimate))) %>%
        filter(estimate != 0) 

print(lasso_features, n = Inf)

#> # A tibble: 21 × 3
#>    term        estimate penalty
#>    <chr>          <dbl>   <dbl>
#>  1 (Intercept) -2.48    0.00001
#>  2 NKG7         1.70    0.00001
#>  3 CTSW         0.582   0.00001
#>  4 GZMA         0.561   0.00001
#>  5 HLA-DRA     -0.561   0.00001
#>  6 CD74        -0.485   0.00001
#>  7 CD79A       -0.423   0.00001
#>  8 CDC6         0.422   0.00001
#>  9 CDKN3        0.251   0.00001
#> 10 GNLY         0.245   0.00001
#> 11 CST7         0.195   0.00001
#> 12 GZMB         0.174   0.00001
#> 13 TYMS         0.155   0.00001
#> 14 PRF1         0.151   0.00001
#> 15 RRM2         0.142   0.00001
#> 16 KIFC1        0.0690  0.00001
#> 17 LTB         -0.0612  0.00001
#> 18 FGFBP2       0.0588  0.00001
#> 19 MGLL         0.0577  0.00001
#> 20 RPL13A      -0.0456  0.00001
#> 21 CDCA8        0.00842 0.00001

Because Lasso regression enforce sparsity, we can select the features with the coefficients/estimates not equal to 0, but we will miss the correlated genes which were thrown out by the model.

Also, if you are wondering how can I get a p-value for the coefficient, read https://stats.stackexchange.com/questions/410173/lasso-regression-p-values-and-coefficients

Elastic-net

Setting mixture to a value between 0 and 1 lets us use both Lasso and Ridge (Elastic-net!!)

Elastic-net gets the best world of both Lasso and Ridge. Thanks, Matt Bernstein for pointing the original elastic-net paper to me!

https://hastie.su.domains/Papers/elasticnet.pdf

In addition, the elastic net encourages a grouping effect, where strongly correlated predictors tend to be in (out) the model together. The elastic net is particularly useful when the number of predictors (p) is much bigger than the number of observations (n)

elastic_recipe <- 
  recipe(formula = cell_type ~ ., data = data_train) %>% 
  update_role(cell_barcode, new_role = "ID") %>%
  step_zv(all_predictors())

# we will tune both penalty and the mixture
elastic_spec <- 
  logistic_reg(penalty = tune(), mixture = tune()) %>% 
  set_engine("glmnet") %>%
  set_mode("classification")

elastic_workflow <- workflow() %>% 
  add_recipe(elastic_recipe) %>% 
  add_model(elastic_spec)

## tune both the penalty and the mixture from 0-1
penalty_grid <- grid_regular(penalty(range = c(-2, 2)), mixture(), levels = 50)

doParallel::registerDoParallel()
tune_res <- tune_grid(
  elastic_workflow,
  resamples = data_fold, 
  grid = penalty_grid
)

autoplot(tune_res)

best_penalty <- select_best(tune_res, metric = "accuracy")
elastic_final <- finalize_workflow(elastic_workflow, best_penalty)
elastic_final_fit <- fit(elastic_final, data = data_train)

## confusion matrix
predict(elastic_final_fit, new_data = data_test) %>%
        bind_cols(data_test %>% select(cell_type)) %>%
        conf_mat(truth = cell_type, estimate = .pred_class)

#>           Truth
#> Prediction  B NK
#>         B  88  0
#>         NK  0 37

elastic_features<- tidy(elastic_final_fit) %>%
  arrange(desc(abs(estimate))) %>%
  filter(estimate != 0) 

head(elastic_features, n = 20)

#> # A tibble: 20 × 3
#>    term        estimate penalty
#>    <chr>          <dbl>   <dbl>
#>  1 (Intercept)  -1.84      0.01
#>  2 NKG7          0.105     0.01
#>  3 CTSW          0.100     0.01
#>  4 HLA-DRA      -0.0993    0.01
#>  5 GZMB          0.0971    0.01
#>  6 GNLY          0.0971    0.01
#>  7 CD74         -0.0949    0.01
#>  8 GZMA          0.0940    0.01
#>  9 FGFBP2        0.0932    0.01
#> 10 CST7          0.0929    0.01
#> 11 PRF1          0.0922    0.01
#> 12 CD79A        -0.0901    0.01
#> 13 GZMM          0.0867    0.01
#> 14 LTB          -0.0838    0.01
#> 15 LRRIQ3        0.0808    0.01
#> 16 CD247         0.0792    0.01
#> 17 TYROBP        0.0778    0.01
#> 18 CD7           0.0766    0.01
#> 19 FCGR3A        0.0765    0.01
#> 20 FCER1G        0.0712    0.01

head(lasso_features, n = 20)

#> # A tibble: 20 × 3
#>    term        estimate penalty
#>    <chr>          <dbl>   <dbl>
#>  1 (Intercept)  -2.48   0.00001
#>  2 NKG7          1.70   0.00001
#>  3 CTSW          0.582  0.00001
#>  4 GZMA          0.561  0.00001
#>  5 HLA-DRA      -0.561  0.00001
#>  6 CD74         -0.485  0.00001
#>  7 CD79A        -0.423  0.00001
#>  8 CDC6          0.422  0.00001
#>  9 CDKN3         0.251  0.00001
#> 10 GNLY          0.245  0.00001
#> 11 CST7          0.195  0.00001
#> 12 GZMB          0.174  0.00001
#> 13 TYMS          0.155  0.00001
#> 14 PRF1          0.151  0.00001
#> 15 RRM2          0.142  0.00001
#> 16 KIFC1         0.0690 0.00001
#> 17 LTB          -0.0612 0.00001
#> 18 FGFBP2        0.0588 0.00001
#> 19 MGLL          0.0577 0.00001
#> 20 RPL13A       -0.0456 0.00001

merged_markers<- left_join(elastic_features, lasso_features, by = c("term" = "term")) %>%
        dplyr::rename(estimate.elastic = estimate.x, estimate.lasso= estimate.y) %>%
        select(-penalty.x, - penalty.y) 

head(merged_markers, n = 20)

#> # A tibble: 20 × 3
#>    term        estimate.elastic estimate.lasso
#>    <chr>                  <dbl>          <dbl>
#>  1 (Intercept)          -1.84          -2.48  
#>  2 NKG7                  0.105          1.70  
#>  3 CTSW                  0.100          0.582 
#>  4 HLA-DRA              -0.0993        -0.561 
#>  5 GZMB                  0.0971         0.174 
#>  6 GNLY                  0.0971         0.245 
#>  7 CD74                 -0.0949        -0.485 
#>  8 GZMA                  0.0940         0.561 
#>  9 FGFBP2                0.0932         0.0588
#> 10 CST7                  0.0929         0.195 
#> 11 PRF1                  0.0922         0.151 
#> 12 CD79A                -0.0901        -0.423 
#> 13 GZMM                  0.0867        NA     
#> 14 LTB                  -0.0838        -0.0612
#> 15 LRRIQ3                0.0808        NA     
#> 16 CD247                 0.0792        NA     
#> 17 TYROBP                0.0778        NA     
#> 18 CD7                   0.0766        NA     
#> 19 FCGR3A                0.0765        NA     
#> 20 FCER1G                0.0712        NA

we see GZMM is missed by Lasso but picked up by elastic-net. GZMM is probably highly correlated with other granzymes such as GZMB, GZMA which are already in the model.

p1<- ggplot(data_train, aes(x = GZMM, y = GZMB)) +
        geom_point() +
        geom_smooth(method = "lm", se = T) +
        ggpubr::stat_cor() +
        theme_bw(base_size = 14) +
        ggtitle("correlation of GZMB and GZMM")

p2<- ggplot(data_train, aes(x = GZMA, y = GZMB)) +
        geom_point() +
        geom_smooth(method = "lm", se = T) +
        ggpubr::stat_cor() +
        theme_bw(base_size = 14) +
        ggtitle("correlation of GZMA and GZMB")

p1 + p2

FCRR3A (CD16) picked up as an NK cell marker by elastic too.

From this paper. CD16 is a pretty important marker for NK cell subytyping. It is sad that Lasso missed it.

Let’s plot the gene expression level to check the raw data.

Idents(pbmc_subset)<- pbmc_subset$cell_type
scCustomize::Stacked_VlnPlot(pbmc_subset, features = merged_markers %>% slice(-1) %>% pull(term) %>% head(n = 20),
                             colors_use = c("blue", "red") )

Many of the top genes picked up by elastic-net are very convincing to me. LRRIQ3 is low in both cell types.

Things learned

the tidymodel ecosystem makes the interface to different machine learning methods consistent. You can tell how similar the codes are for all three regularization methods (Time to write a function!).
classification for B cells and NK cells are petty easy task as they are very different. All models perform well. Again, the purpose here is not to use the model to do prediction, but use the model to do feature selection.
Seurat has an implementation of the logistic regression for marker gene selection: glm(formula = fmla, data = model.data, family = "binomial")

UPDATE

I did not see MS4A1/CD20in any of the regression results which is a well know B-cell marker. I then went back and found that somehow MS4A1 is not in the top 2000 most variable genes selected by Seurat in the subseted small object. This tells us feature engineer is very important too.

single-cell bioinformatics machine learning