understand 10x scRNAseq and scATAC fastqs

single cell RNAseq Please read the following posts by Dave Tang. When I google, I always find his posts on top of the pages. Thanks for sharing your knowledge. From the 10x manual: The final Single Cell 3’ Libraries contain the P5 and P7 primers used in Illumina bridge amplification PCR. The 10x Barcode and Read 1 (primer site for sequencing read 1) is added to the molecules during the GEMRT incubation.

How to make a transcript to gene mapping file

I need a transcript to gene mapping file for Salmon. I am aware of annotation bioconductor packages that can do this job. However, I was working on a species which does not have the annotation in a package format (I am going to use Drosphila as an example for this blog post). I had to go and got the gtf file and made such a file from scratch. Please read the specifications of those two file formats.

Understanding p value, multiple comparisons, FDR and q value

UPDATE 01/29/2019. Read this awesome paper Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. This was an old post I wrote 3 years ago after I took HarvardX: PH525.3x Advanced Statistics for the Life Sciences on edx taught by Rafael Irizarry. It is still one of the best courses to get you started using R for genomics. I am very thankful to have those high quality classes available to me when I started to learn.

PCA in action

PCA in practice. Principle Component Analysis(PCA) is a very important skill for dimention reduction to analyze high-dimentional data. High-dimentional data are data with features (p) a lot more than observations (n). This types of data are very commonly generated from high-throuput sequencing experiments. For example, an RNA-seq or microarry experiment measures expression of tens of thousands of genes for only 8 samples (4 controls and 4 treatments). Let’s use a microarray data for demonstration.

Three gotchas when using R for Genomic data analysis

During my daily work with R for genomic data analysis, I encountered several instances that R gives me some (bad) surprises. 1. The devil 1 and 0 coordinate system read detail here some files such as bed file is 0 based. Two genomic regions: chr1 0 1000 chr1 1001 2000 when you import that bed file into R using rtracklayer::import(), it will become chr1 1 1000 chr1 1002 2000 The function convert it to 1 based internally (R is 1 based unlike python).

Compute averages/sums on GRanges or equal length bins

Googling is a required technique for programmers. Once I have a programming problem in mind, the first thing I do is to google to see if other people have encountered the same problem and maybe they already have a solution. Do not re-invent the wheels. Actually, reading other people’s code and mimicing their code is a great way of learning. Today, I am going to show you how to compute binned averages/sums along a genome or any genomic regions of interest.

How to upload files to GEO

readings links: 1. create account Go to NCBI GEO: Create User ID and password. my username is research_guru I used my google account. 2. fill in the xls sheet Downloaded the meta xls sheet from

bgzip the fastqs cd 01seq find *fastq | parallel bgzip md5sum *fastq.gz > fastq_md5.txt # copy to excle cat fastq_md5.txt | awk ‘{print $2}’ #copy to excle cat fastq_md5.