My odyssey of obtaining scRNAseq metadata

Jun 8, 2022 11 min read bioinformatics, scRNAseq, R

I want to curate a public scRNAseq dataset from this paper Single-cell analyses reveal key immune cell subsets associated with response to PD-L1 blockade in triple-negative breast cancer

ffq

I first tried ffq, but it gave me errors.

ffq fetches metadata information from the following databases:

GEO: Gene Expression Omnibus,
SRA: Sequence Read Archive,
EMBL-EBI: European Molecular BIology Laboratory’s European BIoinformatics Institute,
DDBJ: DNA Data Bank of Japan,
NIH Biosample: Biological source materials used in experimental assays,
ENCODE: The Encyclopedia of DNA Elements.

(base) ➜  ~ ffq GSE169246
[2022-06-06 14:32:22,716]    INFO Parsing GEO GSE169246
[2022-06-06 14:32:24,418]    INFO Finding supplementary files for GEO GSE169246
[2022-06-06 14:32:28,001] WARNING There are 83 samples for GSE169246
[2022-06-06 14:32:28,001]    INFO Parsing GSM GSM5188367
[2022-06-06 14:32:33,047]    INFO Finding supplementary files for GSM GSM5188367
[2022-06-06 14:32:34,409]    INFO No supplementary files found for GSM5188367
[2022-06-06 14:32:39,659]    INFO Getting sample for GSM5188367
[2022-06-06 14:32:41,847] WARNING No sample found. Either the provided GSM accession is invalid or raw data was not provided for this record

get the metadata using GEOquery

other methods: pysradb https://github.com/saketkc/pysradb

I then tried GEOquery

library(GEOquery)
library(tidyverse)
library(here)
#getGEOSuppFiles(GEO ="GSE176021")
# one can also get the matrix by setting GSEMatrix to TRUE
meta<- getGEO(GEO= "GSE169246",GSEMatrix=FALSE)

The meta object’s gsms slot contains useful information

## metadata 
meta@gsms$GSM5188440@header$characteristics_ch1

#> [1] "tissue: blood"               "disease state: TNBC patient"
#> [3] "group: Progression"          "treatment: Chemo"

## sample ID, I need it because the count matrix cell name is prefixed by it
meta@gsms$GSM5188440@header$title

#> [1] "Prog_P008_b"

## patient ID
meta@gsms$GSM5188440@header$source_name_ch1

#> [1] "P008"

meta@gsms is a list of lists. purrr::map() is a perfect tool to tidy it.

you can learn more purrr from https://jennybc.github.io/purrr-tutorial/

# each list entry is one sample
names(meta@gsms)

#>  [1] "GSM5188367" "GSM5188368" "GSM5188369" "GSM5188370" "GSM5188371"
#>  [6] "GSM5188372" "GSM5188373" "GSM5188374" "GSM5188375" "GSM5188376"
#> [11] "GSM5188377" "GSM5188378" "GSM5188379" "GSM5188380" "GSM5188381"
#> [16] "GSM5188382" "GSM5188383" "GSM5188384" "GSM5188385" "GSM5188386"
#> [21] "GSM5188387" "GSM5188388" "GSM5188389" "GSM5188390" "GSM5188391"
#> [26] "GSM5188392" "GSM5188393" "GSM5188394" "GSM5188395" "GSM5188396"
#> [31] "GSM5188397" "GSM5188398" "GSM5188399" "GSM5188400" "GSM5188401"
#> [36] "GSM5188402" "GSM5188403" "GSM5188404" "GSM5188405" "GSM5188406"
#> [41] "GSM5188407" "GSM5188408" "GSM5188409" "GSM5188410" "GSM5188411"
#> [46] "GSM5188412" "GSM5188413" "GSM5188414" "GSM5188415" "GSM5188416"
#> [51] "GSM5188417" "GSM5188418" "GSM5188419" "GSM5188420" "GSM5188421"
#> [56] "GSM5188422" "GSM5188423" "GSM5188424" "GSM5188425" "GSM5188426"
#> [61] "GSM5188427" "GSM5188428" "GSM5188429" "GSM5188430" "GSM5188431"
#> [66] "GSM5188432" "GSM5188433" "GSM5188434" "GSM5188435" "GSM5188436"
#> [71] "GSM5188437" "GSM5188438" "GSM5188439" "GSM5188440" "GSM5188441"
#> [76] "GSM5188442" "GSM5188443" "GSM5188444" "GSM5188445" "GSM5585280"
#> [81] "GSM5585281" "GSM5585282" "GSM5585283"

meta@gsms$GSM5188367@header

#> $channel_count
#> [1] "1"
#> 
#> $characteristics_ch1
#> [1] "tissue: blood"               "disease state: TNBC patient"
#> [3] "group: Pre-treatment"        "treatment: anti-PDL1+Chemo" 
#> 
#> $contact_address
#> [1] "Yiheyuan Road"
#> 
#> $contact_city
#> [1] "Beijing"
#> 
#> $contact_country
#> [1] "China"
#> 
#> $contact_department
#> [1] "BIOPIC"
#> 
#> $contact_email
#> [1] "zhangyuanyuanbio@pku.edu.cn"
#> 
#> $contact_institute
#> [1] "Peking Univerisity"
#> 
#> $contact_laboratory
#> [1] "Zemin Zhang's Lab"
#> 
#> $contact_name
#> [1] "Yuanyuan,,Zhang"
#> 
#> $contact_state
#> [1] "Beijing"
#> 
#> $`contact_zip/postal_code`
#> [1] "100871"
#> 
#> $data_processing
#> [1] "For 10X data, Cell Ranger 3.0.0 was used to quantify gene expression level, identify TCR sequences and quantify ATAC-seq peaks ."
#> [2] "Genome_build: GRCh38 for RNA expression,TCR data and ATAC peaks."                                                                
#> 
#> $data_row_count
#> [1] "0"
#> 
#> $description
#> [1] "10X Genomics platform"
#> 
#> $extract_protocol_ch1
#> [1] "Peripheral blood mononuclear cells (PBMCs) were isolated with HISTOPAQUE-1077 (Sigma-Aldrich) solution. In brief, 3 ml fresh peripheral blood was collected at baseline, 4 weeks after treatment initiation and disease progression in EDTA anticoagulant tubes and subsequently layered onto HISTOPAQUE-1077. Tumor biopsy samples were first stored in RNAlater RNA stabilization reagent (QIAGEN) after collection and kept on ice to avoid RNA degradation. Samples were then collected in the RPMI-1640 medium (Gibco) with 10% FBS (Gibco), and enzymatically digested with gentle MACS Tumor Dissociation Kit (Miltenyi Biotec) for 60 min on a rotor at 37 °C according to the manufacturer’s protocol. The dissociated cells were next passed through a 40-µm cell-strainer (BD) in the RPMI-1640 medium (Invitrogen) with 10% FBS until uniform cell suspensions were obtained. Subsequently, the suspended cells were passed through cell strainers and centrifuged at 400 g for 10 min. Red blood cells were removed via the same procedure described above. After washing twice with 1x PBS (Invitrogen), the cell pellets were re-suspended in sorting buffer (PBS supplemented with 1% FBS)."
#> [2] "single cells were sorted into 1.5 ml tubes (Eppendorf) and counted manually under the microscope. The concentration of single cell suspensions was adjusted to 500-1200 cells/ul. Cells were loaded between 7,000 and 15,000 cells/chip position using the 10x Chromium Single cell 5’ Library, Gel Bead & Multiplex Kit and Chip Kit (10x Genomics, V3 barcoding chemistry) according to the manufacturer’s instructions. All the subsequent steps were performed following the standard manufacturer’s protocols. Purified libraries were analyzed by an Illumina Hiseq X Ten sequencer with 150-bp paired-end reads."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
#> [3] "10x Chromium Single cell 5’ + VDJ Library, and 10x Chromium Single cell ATAC-seq Library"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
#> 
#> $geo_accession
#> [1] "GSM5188367"
#> 
#> $instrument_model
#> [1] "HiSeq X Ten"
#> 
#> $last_update_date
#> [1] "Sep 17 2021"
#> 
#> $library_selection
#> [1] "cDNA"
#> 
#> $library_source
#> [1] "transcriptomic"
#> 
#> $library_strategy
#> [1] "RNA-Seq"
#> 
#> $molecule_ch1
#> [1] "polyA RNA"
#> 
#> $organism_ch1
#> [1] "Homo sapiens"
#> 
#> $platform_id
#> [1] "GPL20795"
#> 
#> $relation
#> [1] "BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN18383051"
#> 
#> $series_id
#> [1] "GSE169246"
#> 
#> $source_name_ch1
#> [1] "P007"
#> 
#> $status
#> [1] "Public on Sep 15 2021"
#> 
#> $submission_date
#> [1] "Mar 19 2021"
#> 
#> $supplementary_file_1
#> [1] "NONE"
#> 
#> $taxid_ch1
#> [1] "9606"
#> 
#> $title
#> [1] "Pre_P007_b"
#> 
#> $treatment_protocol_ch1
#> [1] "Half of 22 TNBC patients were treated with paclitaxel monotherapy and the other half were treated with paclitaxel plus atezolizumab as first-line treatment"
#> 
#> $type
#> [1] "SRA"

# extract the metadata from the header
purrr::map(meta@gsms, ~.x@header$characteristics_ch1)[1:5]

#> $GSM5188367
#> [1] "tissue: blood"               "disease state: TNBC patient"
#> [3] "group: Pre-treatment"        "treatment: anti-PDL1+Chemo" 
#> 
#> $GSM5188368
#> [1] "tissue: lymph_node"          "disease state: TNBC patient"
#> [3] "group: Pre-treatment"        "treatment: anti-PDL1+Chemo" 
#> 
#> $GSM5188369
#> [1] "tissue: blood"               "disease state: TNBC patient"
#> [3] "group: Pre-treatment"        "treatment: anti-PDL1+Chemo" 
#> 
#> $GSM5188370
#> [1] "tissue: lymph_node"          "disease state: TNBC patient"
#> [3] "group: Pre-treatment"        "treatment: anti-PDL1+Chemo" 
#> 
#> $GSM5188371
#> [1] "tissue: blood"               "disease state: TNBC patient"
#> [3] "group: Pre-treatment"        "treatment: anti-PDL1+Chemo"

Also, have you used the awesome stack and unstack function in base R? There are so many good hidden functions.

# save it to a variable
meta1<- purrr::map(meta@gsms, ~.x@header$characteristics_ch1) %>%
        stack() %>%
        separate(values, into = c("condition", "value"), sep= ": ")%>%
        pivot_wider(names_from= condition, values_from = value) %>%
        janitor::clean_names()
        

meta2<- purrr::map(meta@gsms, ~.x@header$title) %>%
        stack() %>%
        dplyr::rename(sample_id = values)


meta3<- purrr::map(meta@gsms, ~.x@header$source_name_ch1) %>%
        stack() %>%
        dplyr::rename(patient_id = values)



head(meta1)

#> # A tibble: 6 × 5
#>   ind        tissue     disease_state group         treatment      
#>   <fct>      <chr>      <chr>         <chr>         <chr>          
#> 1 GSM5188367 blood      TNBC patient  Pre-treatment anti-PDL1+Chemo
#> 2 GSM5188368 lymph_node TNBC patient  Pre-treatment anti-PDL1+Chemo
#> 3 GSM5188369 blood      TNBC patient  Pre-treatment anti-PDL1+Chemo
#> 4 GSM5188370 lymph_node TNBC patient  Pre-treatment anti-PDL1+Chemo
#> 5 GSM5188371 blood      TNBC patient  Pre-treatment anti-PDL1+Chemo
#> 6 GSM5188372 blood      TNBC patient  Pre-treatment Chemo

head(meta2)

#>    sample_id        ind
#> 1 Pre_P007_b GSM5188367
#> 2 Pre_P007_t GSM5188368
#> 3 Pre_P012_b GSM5188369
#> 4 Pre_P012_t GSM5188370
#> 5 Pre_P014_b GSM5188371
#> 6 Pre_P023_b GSM5188372

head(meta3)

#>   patient_id        ind
#> 1       P007 GSM5188367
#> 2       P007 GSM5188368
#> 3       P012 GSM5188369
#> 4       P012 GSM5188370
#> 5       P014 GSM5188371
#> 6       P023 GSM5188372

##merge all of them

meta<- left_join(meta1, meta2) %>%
        left_join(meta3)

head(meta)

#> # A tibble: 6 × 7
#>   ind        tissue     disease_state group       treatment sample_id patient_id
#>   <fct>      <chr>      <chr>         <chr>       <chr>     <chr>     <chr>     
#> 1 GSM5188367 blood      TNBC patient  Pre-treatm… anti-PDL… Pre_P007… P007      
#> 2 GSM5188368 lymph_node TNBC patient  Pre-treatm… anti-PDL… Pre_P007… P007      
#> 3 GSM5188369 blood      TNBC patient  Pre-treatm… anti-PDL… Pre_P012… P012      
#> 4 GSM5188370 lymph_node TNBC patient  Pre-treatm… anti-PDL… Pre_P012… P012      
#> 5 GSM5188371 blood      TNBC patient  Pre-treatm… anti-PDL… Pre_P014… P014      
#> 6 GSM5188372 blood      TNBC patient  Pre-treatm… Chemo     Pre_P023… P023

## remove the scATACseq samples
meta<- meta %>%
        filter(!str_detect(sample_id, "ATAC-seq"))

Get the response data

Now, I will need to go to the publication and find the supplementary table which contains the response data. It looks like this:

Everyone should read this https://datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/ and Data organization in spreadsheet

Using excel wisely is not a rocket science, but using it correctly can make bioinformatician’s life much easier.

clinical_data<- readxl::read_xlsx("~/Downloads/1-s2.0-S1535610821004992-mmc2.xlsx", skip = 1)

## there are many places need to be fixed
print(clinical_data, n = 40)

#> # A tibble: 27 × 19
#>    Treatment     `Patient ID` `Biopsied lesi…` `Diameter of b…` `Size of targe…`
#>    <chr>         <chr>        <chr>            <chr>                       <dbl>
#>  1 <NA>          <NA>         <NA>             <NA>                           NA
#>  2 Anti-PD-L1+ … P019         Lymph Node       15                             36
#>  3 <NA>          P010         Lung             21                             35
#>  4 <NA>          P012         Lymph Node       28                             28
#>  5 <NA>          P007         Lymph Node       22                             22
#>  6 <NA>          P017         Lymph Node       16                             45
#>  7 <NA>          P001         -                -                              45
#>  8 <NA>          P002         Chest Wall       48                             48
#>  9 <NA>          P014         -                -                              11
#> 10 <NA>          P004         Chest Wall       35                             35
#> 11 <NA>          P005         Liver            87                             97
#> 12 <NA>          P016         Chest Wall       24                             24
#> 13 Chemo         P022         Breast           33                             48
#> 14 <NA>          P011         -                -                              30
#> 15 <NA>          P020         Breast           37                             55
#> 16 <NA>          P008         Lung             22                             22
#> 17 <NA>          P013         Liver            36                            152
#> 18 <NA>          P025         Breast           26                             26
#> 19 <NA>          P018         Breast           48                             48
#> 20 <NA>          P023         Breast           26                             42
#> 21 <NA>          P024         Breast           72                             95
#> 22 <NA>          P003         Chest Wall       11                             30
#> 23 <NA>          P028         -                26                             57
#> 24 N,Not availa… <NA>         <NA>             <NA>                           NA
#> 25 * Negative v… <NA>         <NA>             <NA>                           NA
#> 26 # PR,partial… <NA>         <NA>             <NA>                           NA
#> 27 $The + sign … <NA>         <NA>             <NA>                           NA
#> # … with 14 more variables: Age <dbl>, `Stage (TNM )` <chr>, `PD-L1` <chr>,
#> #   `TILs (%)` <chr>, Tumor <chr>, ...11 <chr>, ...12 <chr>, Blood <chr>,
#> #   ...14 <chr>, ...15 <chr>,
#> #   `Relative change of target lesions (8 weeks after treatment initiation vs. Pre-treatment)*` <chr>,
#> #   `Relative change of biopsied lesions (8 weeks after treatment initiation vs. Pre-treatment)*` <chr>,
#> #   `Clinical efficacy#` <chr>, `Time to progression (months)$` <chr>

## remove the first row and the last 4 rows, fill in the NAs for the Treatment 
clinical_data<- clinical_data %>%
        janitor::clean_names() %>%
        slice(2:23) %>%
        tidyr::fill(treatment) %>%
        dplyr::select(!(tumor:x15))

yeah, and N for not available , and sometimes - means not available. let’s fix that.

clinical_data<- clinical_data %>%
        #trim off the white spaces. this will help the later step too
        mutate_all(~str_trim(.x, side ="both")) %>%
        mutate_at(vars(-pd_l1), ~str_replace(.x , "^-$", NA_character_)) %>%
        mutate_all( ~str_replace(.x , "^N$", NA_character_))

merge the two tables

meta<- meta %>%
        left_join(clinical_data) %>%
        arrange(patient_id, biopsied_lesion) %>%
        select(-biopsied_lesion)

print(meta, n = 20)

#> # A tibble: 83 × 17
#>    ind        tissue     disease_state group      treatment sample_id patient_id
#>    <fct>      <chr>      <chr>         <chr>      <chr>     <chr>     <chr>     
#>  1 GSM5188381 blood      TNBC patient  Pre-treat… anti-PDL… Pre_P001… P001      
#>  2 GSM5188414 blood      TNBC patient  Post-trea… anti-PDL… Post_P00… P001      
#>  3 GSM5188376 blood      TNBC patient  Pre-treat… anti-PDL… Pre_P002… P002      
#>  4 GSM5188377 chest_wall TNBC patient  Pre-treat… anti-PDL… Pre_P002… P002      
#>  5 GSM5188410 blood      TNBC patient  Post-trea… anti-PDL… Post_P00… P002      
#>  6 GSM5188411 chest_wall TNBC patient  Post-trea… anti-PDL… Post_P00… P002      
#>  7 GSM5188437 blood      TNBC patient  Progressi… anti-PDL… Prog_P00… P002      
#>  8 GSM5188401 chest_wall TNBC patient  Post-trea… Chemo     Post_P00… P003      
#>  9 GSM5585280 blood      TNBC patient  Pre-treat… Anti-PD-… Pre_P004… P004      
#> 10 GSM5585281 chest_wall TNBC patient  Pre-treat… Anti-PD-… Pre_P004… P004      
#> 11 GSM5585282 blood      TNBC patient  Post-trea… Anti-PD-… Post_P00… P004      
#> 12 GSM5585283 blood      TNBC patient  Progressi… Anti-PD-… Prog_P00… P004      
#> 13 GSM5188398 blood      TNBC patient  Pre-treat… anti-PDL… Pre_P005… P005      
#> 14 GSM5188399 liver      TNBC patient  Pre-treat… anti-PDL… Pre_P005… P005      
#> 15 GSM5188430 blood      TNBC patient  Post-trea… anti-PDL… Post_P00… P005      
#> 16 GSM5188431 liver      TNBC patient  Post-trea… anti-PDL… Post_P00… P005      
#> 17 GSM5188367 blood      TNBC patient  Pre-treat… anti-PDL… Pre_P007… P007      
#> 18 GSM5188368 lymph_node TNBC patient  Pre-treat… anti-PDL… Pre_P007… P007      
#> 19 GSM5188402 blood      TNBC patient  Post-trea… anti-PDL… Post_P00… P007      
#> 20 GSM5188433 blood      TNBC patient  Progressi… anti-PDL… Prog_P00… P007      
#> # … with 63 more rows, and 10 more variables:
#> #   diameter_of_biopsied_lesion_mm <chr>, size_of_target_lesion_som_mm <chr>,
#> #   age <chr>, stage_tnm <chr>, pd_l1 <chr>, ti_ls_percent <chr>,
#> #   relative_change_of_target_lesions_8_weeks_after_treatment_initiation_vs_pre_treatment <chr>,
#> #   relative_change_of_biopsied_lesions_8_weeks_after_treatment_initiation_vs_pre_treatment <chr>,
#> #   clinical_efficacy_number <chr>, time_to_progression_months <chr>

That’s how much it takes to get the sample level metadata from a public dataset. It is hard to fully automate and one has to go to GEO and supplementary tables, and wrangle the data to a desired format. If you ask me, I will tell you this is the real-life bioinformatics.

Happy BioinFORMATics

scRNAseq

My odyssey of obtaining scRNAseq metadata

ffq

get the metadata using GEOquery

Get the response data

Related