Merge featureCount table from RNAseq

Nov 27, 2018 2 min read

featureCounts is a program to fast summarize counts from sequencing data. I use it to get gene-level RNAseq counts by

featureCounts -p -t exon -g gene_id -a annotation.gtf -o mysample_featureCount.txt mapping_results_PE.bam

If you have a lot of samples, you will get a lot of *featureCount.txt and you will need to merge them for downstream analysis.

I will show you how to merge the tables using R, python and unix below.

R solution

library(purrr)
library(tidyverse)
f_files<- list.files("results/superEnhancer/rna_expression/MSTC", pattern = "featureCount.txt", full.names = T)

read_in_feature_counts<- function(file){
        cnt<- read_tsv(file, col_names =T, comment = "#")
        cnt<- cnt %>% dplyr::select(-Chr, -Start, -End, -Strand, -Length)
        return(cnt)
}
        
raw_counts<- map(f_files, read_in_feature_counts)
raw_counts_df<- purrr::reduce(raw_counts, inner_join)

python solution

I am still very crude with python :) It works for python2.

import os
import csv
import glob

## after some google https://mail.python.org/pipermail/tutor/2004-November/033475.html
## The idea is to keep the count column into a list.

files = glob.glob("*featureCount.txt")
list_column = []
n = 1
for file in files:
    print n
    column_data = []
    with open(file, 'r') as f:
        reader = csv.reader(f, delimiter = "\t")
            # skip the comment line
        comment = next(reader)
        if n <= 1:
            for row in reader:
                # for the first file, keep the gene column as well
                column_data.append(row[0] + '\t' + row[6])
        else:
            for row in reader:
                column_data.append(row[6])
        n = n + 1
    list_column.append(column_data)


# This creates a list of row lists from the list of column lists
# If any of the column lists are too short they will be padded with None
# map function is a gem :)
rows = map(None, *list_column)

with open('output.txt','w') as f_out:
     for row in rows:
         f_out.write('\t'.join(row))
         f_out.write('\n')

unix command line solution

# get the count
ls -1  *featureCount.txt | parallel 'cat {} | sed '1d' | cut -f7 {} > {/.}_clean.txt' 
ls -1  *featureCount.txt | head -1 | xargs cut -f1 > genes.txt
paste genes.txt *featureCount_clean.txt > output.txt