R

Modeling single cell RNAseq data with multinomial distribution

I was reading Feature Selection and Dimension Reduction for Single Cell RNA-Seq based on a Multinomial Model. In the paper, the authors model the scRNAseq counts using a multinomial distribution. I was using negative binomial distribution for modeling in my last post, so I asked the question on twitter: for modeling RNAseq counts, what’s the difference/advantages using negative binomial and multinomial distribution? — Ming Tang (@tangming2005) November 26, 2019 some quotes from the answers I get from Matthew

negative bionomial distribution in (single-cell) RNAseq

This post is inspired by two posts written by Valentine Svensson: http://www.nxn.se/valent/2017/11/16/droplet-scrna-seq-is-not-zero-inflated http://www.nxn.se/valent/2018/1/30/count-depth-variation-makes-Poisson-scrna-seq-data-negative-binomial The original ipython notebook can be found at https://github.com/vals/Blog/blob/master/171116-zero-inflation/Negative%20control%20analysis.ipynb Thanks for writing those and put both the data and code in public. After I read Droplet scRNA-seq is not zero-inflated by Valentine Svensson, I want to gain some understanding of it. This post is an effort to replicate some of the analysis in the preprint using R. The original analysis was carried out in python.

clustering scATACseq data: the TF-IDF way

scATACseq data are very sparse. It is sparser than scRNAseq. To do clustering of scATACseq data, there are some preprocessing steps need to be done. I want to reproduce what has been done after reading the method section of these two recent scATACseq paper: A Single-Cell Atlas of In Vivo Mammalian Chromatin Accessibility Darren et.al Cell 2018 Latent Semantic Indexing Cluster Analysis In order to get an initial sense of the relationship between individual cells, we first broke the genome into 5kb windows and then scored each cell for any insertions in these windows, generating a large, sparse, binary matrix of 5kb windows by cells for each tissue.

plot 10x scATAC coverage by cluster/group

This post was inspired by Andrew Hill’s recent blog post. Inspired by some nice posts by @timoast and @tangming2005 and work from @10xGenomics. Would still definitely have to split BAM files for other tasks, so easy to use tools for that are super useful too! — Andrew J Hill (@ahill_tweets) April 13, 2019 Andrew wrote that blog post in light of my other recent blog post and Tim’s (developer of the almighty Seurat package) blog post.

Use docopt to write command line R utilities

I was writing an R script to plot the ATACseq fragment length distribution and wanted to turn the R script to a command line utility. I then (re)discovered this awesome docopt.R. One just needs to write the help message the you want to display and docopt() will parse the options, arguments and return a named list which can be accessed inside the R script. check http://docopt.org/ for more information as well.

Evaluating single cell RNAseq cluster stability

An R package for evaluating and visualizing scRNAseq cluster stability

Three gotchas when using R for Genomic data analysis

During my daily work with R for genomic data analysis, I encountered several instances that R gives me some (bad) surprises. 1. The devil 1 and 0 coordinate system read detail here https://github.com/crazyhottommy/DNA-seq-analysis#tips-and-lessons-learned-during-my-dna-seq-data-analysis-journey some files such as bed file is 0 based. Two genomic regions: chr1 0 1000 chr1 1001 2000 when you import that bed file into R using rtracklayer::import(), it will become chr1 1 1000 chr1 1002 2000 The function convert it to 1 based internally (R is 1 based unlike python).