During my daily work with R for genomic data analysis, I encountered several instances that R gives me some (bad) surprises.
1. The devil 1 and 0 coordinate system
some files such as
bed file is 0 based. Two genomic regions:
chr1 0 1000 chr1 1001 2000
when you import that bed file into R using
rtracklayer::import(), it will become
chr1 1 1000 chr1 1002 2000
The function convert it to 1 based internally (R is 1 based unlike python).
The problem is that when you read the bed file with
read.table and use
GenomicRanges::makeGRangesFromDataFrame() to convert it to a GRanges object, do not forget to add 1 to the start before doing it.
Similarily, when you write a GRanges object to disk using
rtracklayer::export, you do not need to worry, R will convert it back to 0 based in file.
However, if you make a dataframe out of the GRanges object, you need to remember do
start -1 before writing to a file.
read_tsv column types
If you use
readr, it will use the first 1000 rows to determine the column types (integer, charater etc). For genomic data, however, especially for the chromosome column, you may or may not have
1 0 1000 1 1000 2000 . . . X Y MT
you may fail to read rows for chromosome X, Y and MT. (To make things worse, UCSC uses chrM rather than chrMT…)
The solution is that read in all the data as characters.
library(readr) challenge2 <- read_tsv("my.bed", col_types = cols(.default = col_character()) )
3. Scientific notation for genomic coordinates
This is kind of related to 2.
1200000 will be written as
1.2e6 in a dataframe if R thinks it is an integer. So, you will need to read in the columns all as characters, or if you convert the character to numeric and wants to write to a file,
options(scipen=500) on the top of your script.
The scientific notation can not be disabled in
One more gotcha for rownames and colnames
base R will change the name with
- to a
.. e.g. TCGA-06-ABCD will be changed to TCGA.06.ABCD. this can cause troubles when you use the name of the columns to match samples.
readr will maintain the