I am interested in learning more on matrix factorization and its application in scRNAseq data. I want to shout out to this paper: Enter the Matrix: Factorization Uncovers Knowledge from Omics by Elana J. Fertig group.
A matrix is decomposed to two matrices: the amplitude matrix and the pattern matrix. You can then do all sorts of things with the decomposed matrices. Single cell matrix is no special, one can use the matrix factorization techniques to derive interesting biological insights.
Dotplot is a nice way to visualize scRNAseq expression data across clusters. It gives information (by color) for the average expression level across cells within the cluster and the percentage (by size of the dot) of the cells express that gene within the cluster.
Seurat has a nice function for that. However, it can not do the clustering for the rows and columns. David McGaughey has written a blog post using ggplot2 and ggtree from Guangchuang Yu.
When it comes to make a heatmap, ComplexHeatmap by Zuguang Gu is my favorite. Check it out! You will be amazed on how flexible it is and the documentation is in top niche.
For Single-cell RNAseq, Seurat provides a DoHeatmap function using ggplot2. There are two limitations:
when your genes are not in the top variable gene list, the scale.data will not have that gene and DoHeatmap will drop those genes.
It is very common to see in the scRNAseq papers that the authors compare cell type abundance across groups (e.g., treatment vs control, responder vs non-responder).
Let’s create some dummy data.
library(tidyverse) set.seed(23) # we have 6 treatment samples and 6 control samples, 3 clusters A,B,C # but in the treatment samples, cluster C is absent (0 cells) in sample7 sample_id<- c(paste0("sample", 1:6, "_control", rep(c("_A","_B","_C"),each = 6)), paste0("sample", 8:12, "_treatment", rep(c("_A","_B", "_C"), each = 5))) sample_id<- c(sample_id, "sample7_treatment_A", "sample7_treatment_B") cell_id<- paste0("cell", 1:20000) cell_df<- tibble::tibble(sample_id = sample(sample_id, size = length(cell_id), replace = TRUE), cell_id = cell_id) %>% tidyr::separate(sample_id, into = c("sample_id", "group", "clusterid"), sep= "") cell_num<- cell_df %>% group_by(group, cluster_id, sample_id)%>% summarize(n=n()) cell_num ## # A tibble: 35 x 4 ## # Groups: group, cluster_id  ## group cluster_id sample_id n ## <chr> <chr> <chr> <int> ## 1 control A sample1 551 ## 2 control A sample2 546 ## 3 control A sample3 544 ## 4 control A sample4 585 ## 5 control A sample5 588 ## 6 control A sample6 542 ## 7 control B sample1 550 ## 8 control B sample2 562 ## 9 control B sample3 574 ## 10 control B sample4 563 ## # … with 25 more rows total_cells<- cell_df %>% group_by(sample_id) %>% summarise(total = n()) total_cells ## # A tibble: 12 x 2 ## sample_id total ## <chr> <int> ## 1 sample1 1673 ## 2 sample10 1713 ## 3 sample11 1691 ## 4 sample12 1696 ## 5 sample2 1633 ## 6 sample3 1700 ## 7 sample4 1711 ## 8 sample5 1768 ## 9 sample6 1727 ## 10 sample7 1225 ## 11 sample8 1720 ## 12 sample9 1743 join the two dataframe to get percentage of cells per cluster per sample
In my last post, I tried to include transgenes to the cellranger reference and want to get the counts for the transgenes. However, even after I extended the Tdtomato and Cre with the potential 3’UTR, I still get very few cells express them. This is confusing to me.
My next thought is: maybe the STAR aligner is doing something weird that excluded those reads? At this point, I want to give kb-python, a python wrapper on kallisto and bustools a try.
Using nested dataframe and list column has transformed my way of data wrangling in R. For more on this topic, I highly recommend purrr tutorial from Jenney Bryan.
In this post, I am going to show you how I use this to solve a problem for adding pct_in column from the differential scRNAseq result table.
I am going to use presto for differential gene expression test. presto performs a fast Wilcoxon rank sum test and auROC analysis.