Genomics

How to make a triangle correlation heatmap with p-values labeled

In this blog post, I am going to show you how to make a correlation heatmap with p-values and significant values labeled in the heatmap body. Let’s use the PBMC single cell data as an example. You may want to read my previous blog post How to do gene correlation for single-cell RNAseq data. Load libraries library(dplyr) library(Seurat) library(patchwork) library(ggplot2) library(ComplexHeatmap) library(SeuratData) library(hdWGCNA) library(WGCNA) set.seed(1234) prepare the data data("pbmc3k") pbmc3k #> An object of class Seurat #> 13714 features across 2700 samples within 1 assay #> Active assay: RNA (13714 features, 0 variable features) ## routine processing pbmc3k<- pbmc3k %>% NormalizeData(normalization.

Run Rstudio server with singularity on HPC

Background Please read the following before go ahead: what is docker? what is Rocker? what is singularity? from Harvard Research computing website: Odyssey has singularity installed. Why Singularity? There are some important differences between Docker and Singularity: Docker and Singularity have their own container formats. Docker containers may be imported to run via Singularity. Docker containers need root privileges for full functionality which is not suitable for a shared HPC environment.

Calculate scATACseq TSS enrichment score

TSS enrichment score serves as an important quality control metric for ATACseq data. I want to write a script for single cell ATACseq data. From the Encode page: Transcription Start Site (TSS) Enrichment Score - The TSS enrichment calculation is a signal to noise calculation. The reads around a reference set of TSSs are collected to form an aggregate distribution of reads centered on the TSSs and extending to 1000 bp in either direction (for a total of 2000bp).

understand 10x scRNAseq and scATAC fastqs

single cell RNAseq Please read the following posts by Dave Tang. When I google, I always find his posts on top of the pages. Thanks for sharing your knowledge. https://davetang.org/muse/2018/06/06/10x-single-cell-bam-files/ https://davetang.org/muse/2018/08/09/getting-started-with-cell-ranger/ From the 10x manual: The final Single Cell 3’ Libraries contain the P5 and P7 primers used in Illumina bridge amplification PCR. The 10x Barcode and Read 1 (primer site for sequencing read 1) is added to the molecules during the GEMRT incubation.

How to make a transcript to gene mapping file

I need a transcript to gene mapping file for Salmon. I am aware of annotation bioconductor packages that can do this job. However, I was working on a species which does not have the annotation in a package format (I am going to use Drosphila as an example for this blog post). I had to go and got the gtf file and made such a file from scratch. Please read the specifications of those two file formats.

Understanding p value, multiple comparisons, FDR and q value

UPDATE 01/29/2019. Read this awesome paper Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. This was an old post I wrote 3 years ago after I took HarvardX: PH525.3x Advanced Statistics for the Life Sciences on edx taught by Rafael Irizarry. It is still one of the best courses to get you started using R for genomics. I am very thankful to have those high quality classes available to me when I started to learn.

PCA in action

PCA in practice. Principal Component Analysis(PCA) is a very important skill for dimention reduction to analyze high-dimentional data. High-dimentional data are data with features (p) a lot more than observations (n). This types of data are very commonly generated from high-throuput sequencing experiments. For example, an RNA-seq or microarry experiment measures expression of tens of thousands of genes for only 8 samples (4 controls and 4 treatments). Let’s use a microarray data for demonstration.

Three gotchas when using R for Genomic data analysis

During my daily work with R for genomic data analysis, I encountered several instances that R gives me some (bad) surprises. 1. The devil 1 and 0 coordinate system read detail here https://github.com/crazyhottommy/DNA-seq-analysis#tips-and-lessons-learned-during-my-dna-seq-data-analysis-journey some files such as bed file is 0 based. Two genomic regions: chr1 0 1000 chr1 1001 2000 when you import that bed file into R using rtracklayer::import(), it will become chr1 1 1000 chr1 1002 2000 The function convert it to 1 based internally (R is 1 based unlike python).

Compute averages/sums on GRanges or equal length bins

Googling is a required technique for programmers. Once I have a programming problem in mind, the first thing I do is to google to see if other people have encountered the same problem and maybe they already have a solution. Do not re-invent the wheels. Actually, reading other people’s code and mimicing their code is a great way of learning. Today, I am going to show you how to compute binned averages/sums along a genome or any genomic regions of interest.

How to upload files to GEO

readings links: http://yeolab.github.io/onboarding/geo.html http://www.hildeschjerven.net/Protocols/Submission_of_HighSeq_data_to_GEO.pdf https://www.ncbi.nlm.nih.gov/geo/info/submissionftp.html 1. create account Go to NCBI GEO: http://www.ncbi.nlm.nih.gov/geo/ Create User ID and password. my username is research_guru I used my google account. 2. fill in the xls sheet Downloaded the meta xls sheet from https://www.ncbi.nlm.nih.gov/geo/info/seq.html

bgzip the fastqs cd 01seq find *fastq | parallel bgzip md5sum *fastq.gz > fastq_md5.txt # copy to excle cat fastq_md5.txt | awk ‘{print $2}’ #copy to excle cat fastq_md5.