R

stacked violin plot for visualizing single-cell data in Seurat

In scanpy, there is a function to create a stacked violin plot. There is no such function in Seurat, and many people were asking for this feature. e.g. https://github.com/satijalab/seurat/issues/300 https://github.com/satijalab/seurat/issues/463 The developers have not implemented this feature yet. In this post, I am trying to make a stacked violin plot in Seurat. The idea is to create a violin plot per gene using the VlnPlot in Seurat, then customize the axis text/tick and reduce the margin for each plot and finally concatenate by cowplot::plot_grid or patchwork::wrap_plots.

compare kallisto-bustools and cellranger for single nuclei sequencing data

In my last post, I tried to include transgenes to the cellranger reference and want to get the counts for the transgenes. However, even after I extended the Tdtomato and Cre with the potential 3’UTR, I still get very few cells express them. This is confusing to me. My next thought is: maybe the STAR aligner is doing something weird that excluded those reads? At this point, I want to give kb-python, a python wrapper on kallisto and bustools a try.

My opinionated selection of books/urls for bioinformatics/data science curriculum

There was a paper on this topic: A New Online Computational Biology Curriculum. I am going to provide a biased list below (I have read most of the books if not all). I say it is biased because you will see many books of R are from Hadely Wickham. I now use tidyverse most of the time. Unix I suggest people who want to learn bioinformatics starting to learn unix commands first.

Develop Bioconductor packages with docker container

Readings links to read: https://www.bioconductor.org/developers/package-guidelines/#rcode https://github.com/Bioconductor/Contributions use container https://github.com/Bioconductor/bioconductor_full I am following the last link. pull the container docker pull bioconductor/bioconductor_full:devel docker images REPOSITORY TAG IMAGE ID CREATED SIZE bioconductor/bioconductor_full devel ae3ec2be7376 3 hours ago 5.7GB seuratv3 latest 9b358ab1fd63 2 days ago 2.76GB It is 5.7G in size. start the Rstuido from the image. I have another Rstudio instance using port 8787, let me use a different one (e.

clustering scATACseq data: the TF-IDF way

scATACseq data are very sparse. It is sparser than scRNAseq. To do clustering of scATACseq data, there are some preprocessing steps need to be done. I want to reproduce what has been done after reading the method section of these two recent scATACseq paper: A Single-Cell Atlas of In Vivo Mammalian Chromatin Accessibility Darren et.al Cell 2018 Latent Semantic Indexing Cluster Analysis In order to get an initial sense of the relationship between individual cells, we first broke the genome into 5kb windows and then scored each cell for any insertions in these windows, generating a large, sparse, binary matrix of 5kb windows by cells for each tissue.

plot 10x scATAC coverage by cluster/group

This post was inspired by Andrew Hill’s recent blog post. Inspired by some nice posts by @timoast and @tangming2005 and work from @10xGenomics. Would still definitely have to split BAM files for other tasks, so easy to use tools for that are super useful too! — Andrew J Hill (@ahill_tweets) April 13, 2019 Andrew wrote that blog post in light of my other recent blog post and Tim’s (developer of the almighty Seurat package) blog post.

Use docopt to write command line R utilities

I was writing an R script to plot the ATACseq fragment length distribution and wanted to turn the R script to a command line utility. I then (re)discovered this awesome docopt.R. One just needs to write the help message the you want to display and docopt() will parse the options, arguments and return a named list which can be accessed inside the R script. check http://docopt.org/ for more information as well.

Understanding p value, multiple comparisons, FDR and q value

UPDATE 01/29/2019. Read this awesome paper Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. This was an old post I wrote 3 years ago after I took HarvardX: PH525.3x Advanced Statistics for the Life Sciences on edx taught by Rafael Irizarry. It is still one of the best courses to get you started using R for genomics. I am very thankful to have those high quality classes available to me when I started to learn.

permutation test for PCA components

PCA is a critical method for dimension reduction for high-dimensional data. High-dimensional data are data with features (p) a lot more than observations (n). However, this is changing with single-cell RNAseq data. Now, we can sequence millions (n) of single cells and each cell has ~20,000 genes/features (p). I suggest you read my previous blog post on using svd to calculate PCs. Single-cell expression data PCA In single-cell RNAseq analysis, feature selection will be performed first.

The end of 2018

It is almost the end of 2018. It is a good time to review what I have achieved during the year and look forward to a brand new 2019. I wrote a similar post for 2017 here. Some highlights of the year 2018: My son Noah Tang was born in April. He is so lovely and we love him so much. Can’t believe he is almost 9 months old.