Bioinformatics

Reviving BETA for Python 3: Integrating ChIP-seq and RNA-seq to Predict TF Targets

I started to learn bioinformatics because I needed to analyze public ChIP-seq data in 2012. That’s how I got to know Shirley Liu’s lab at Dana-Farber Cancer Institute. And God knows that I would join her group in 2020 for a staff scientist position to lead the CIDC bioinformatic project. I witnessed the development of many groundbreaking computational tools for genomics in Shirley’s lab. One tool that I found particularly elegant was BETA (Binding and Expression Target Analysis), developed by Su Wang and published in Nature Protocols in 2013.

Understanding prcomp() center and scale Arguments for Single-Cell RNA-seq PCA

During my work with single-cell RNA-seq data, I’ve often encountered confusion about PCA and specifically when to use the center and scale arguments in R’s prcomp() function. While tools like Seurat’s RunPCA() abstract away these details, understanding what happens under the hood is crucial for proper analysis and troubleshooting. In this post, I’ll show you exactly what center and scale do, why they matter, and what happens when you get them wrong.

I regret not doing so

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. My regret is not learning linear algebra well during college. I barely passed the exam for it (and calculus, it was a nightmare :) ). To be fair.. It was not taught well and it sounded too boring. I did not know what the application of matrix multiplication was, not until… Many years later, I started to learn bioinformatics.

How CCA alignment and cell label transfer work in Seurat

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. Understand CCA Following my last blog post on PCA projection and cell label transfer, we are going to talk about CCA. In single-cell RNA-seq data integration using Canonical Correlation Analysis (CCA), we typically align two matrices representing different datasets, where both datasets have the same set of genes but different numbers of cells.

How PCA projection and cell label transfer work in Seurat

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. Understand the example datasets We will use PBMC3k and PBMC10k data. We will project the PBMC3k data to the PBMC10k data and get the labels library(Seurat) library(Matrix) library(irlba) # For PCA library(RcppAnnoy) # For fast nearest neighbor search library(dplyr) # Assuming the PBMC datasets (3k and 10k) are already normalized # and represented as sparse matrices # devtools::install_github('satijalab/seurat-data') library(SeuratData) #AvailableData() #InstallData("pbmc3k") pbmc3k<-UpdateSeuratObject(pbmc3k) pbmc3k@meta.

You need to master it if you deal with genomics data

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. Motivation What’s the most common problem you need to solve when dealing with genomics data? For me, it is Genomic Intervals! The genomics data usually represents linearly: chromosome name, start and end. We use it to define a region in the genome ( A peak from ChIP-seq data); the location of a gene, a DNA methylation site ( a single point), a mutation call (a single point), and a duplication region in cancer etc.

A docker image to keep this site alive

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. I have been writing blog posts for over 10 years. I was using blogspot and in 2018, I switched to blogdown and I love it. My blogdown website divingintogeneticsandgenomics.com was using Hugo v0.42 and blogdown v1.0. It has been many years and now I have a macbook pro with an M3 chip. I could not install the old versions of the R packages to serve the site.

The Most Common Mistake In Bioinformatics, one-off error

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. In my last blog post, I talked about some common bioinformatics mistakes. Today, we are going to talk about THE MOST common bioinformatics mistake people make. And I think it deserves a separate post about it. Even some experienced programmers get it wrong and the mistake prevails in many bioinformatics software: The one-off mistake!

The Most Common Stupid Mistakes In Bioinformatics

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. This post is inspired by this popular thread in https://www.biostars.org/. Common mistakes in general Off-by-One Errors: Mistakes occur when switching between different indexing systems. For example, BED files are 0-based while GFF/GTF files are 1-based, leading to potential misinterpretations of genomic coordinates. This is one of the most common mistakes!

Six tips to build a strong Bioinformatics CV

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. If you apply for a Bioinformatics position, hundreds of CVs get to sent to the hiring manager. How to stand out among all of them? Below are 6 tips from my hiring experience: Include a GitHub Link: Ensure your CV has a GitHub link with relevant content like Python or R packages, data analysis projects, or replicated figures from published papers.