R

Use docopt to write command line R utilities

I was writing an R script to plot the ATACseq fragment length distribution and wanted to turn the R script to a command line utility. I then (re)discovered this awesome docopt.R. One just needs to write the help message the you want to display and docopt() will parse the options, arguments and return a named list which can be accessed inside the R script. check http://docopt.org/ for more information as well.

Understanding p value, multiple comparisons, FDR and q value

UPDATE 01/29/2019. Read this awesome paper Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. This was an old post I wrote 3 years ago after I took HarvardX: PH525.3x Advanced Statistics for the Life Sciences on edx taught by Rafael Irizarry. It is still one of the best courses to get you started using R for genomics. I am very thankful to have those high quality classes available to me when I started to learn.

permutation test for PCA components

PCA is a critical method for dimension reduction for high-dimensional data. High-dimensional data are data with features (p) a lot more than observations (n). However, this is changing with single-cell RNAseq data. Now, we can sequence millions (n) of single cells and each cell has ~20,000 genes/features (p). I suggest you read my previous blog post on using svd to calculate PCs. Single-cell expression data PCA In single-cell RNAseq analysis, feature selection will be performed first.

The end of 2018

It is almost the end of 2018. It is a good time to review what I have achieved during the year and look forward to a brand new 2019. I wrote a similar post for 2017 here. Some highlights of the year 2018: My son Noah Tang was born in April. He is so lovely and we love him so much. Can’t believe he is almost 9 months old.

Three gotchas when using R for Genomic data analysis

During my daily work with R for genomic data analysis, I encountered several instances that R gives me some (bad) surprises. 1. The devil 1 and 0 coordinate system read detail here https://github.com/crazyhottommy/DNA-seq-analysis#tips-and-lessons-learned-during-my-dna-seq-data-analysis-journey some files such as bed file is 0 based. Two genomic regions: chr1 0 1000 chr1 1001 2000 when you import that bed file into R using rtracklayer::import(), it will become chr1 1 1000 chr1 1002 2000 The function convert it to 1 based internally (R is 1 based unlike python).

Compute averages/sums on GRanges or equal length bins

Googling is a required technique for programmers. Once I have a programming problem in mind, the first thing I do is to google to see if other people have encountered the same problem and maybe they already have a solution. Do not re-invent the wheels. Actually, reading other people’s code and mimicing their code is a great way of learning. Today, I am going to show you how to compute binned averages/sums along a genome or any genomic regions of interest.

my first try on Rmarkdown using blogdown

I have used blogdown writing regular markdown posts, but the real power is from the Rmarkdown! let me try it for this post. Note that you do not knit the Rmarkdown by yourself, rather you let blogdown do the heavy lift. library(tidyverse) ## Loading tidyverse: ggplot2 ## Loading tidyverse: tibble ## Loading tidyverse: tidyr ## Loading tidyverse: readr ## Loading tidyverse: purrr ## Loading tidyverse: dplyr ## Warning: package ‘tibble’ was built under R version 3.

hugo academic theme blog down deployment (some details)

I have been following this tutorial from Alison and tips from Leslie Myint for some customization for deploying my blogdown website It is quite straightforward to have a working site following Alison’s guide. However, you always want some customization of your own site. I took the tips from Leslie. changed the menue bar to black. I like it better than the default white. in the config.toml file, change the theme: