GEO

Obtain metadata for public datasets in GEO

There are so many public datasets there waiting for us to mine! It is the blessing and cursing as a computational biologist! Metadata, or the data describing (e.g., responder or non-responder for the treatment) the data are critical in interpreting the analysis. Without metadata, your data are useless. People usually go to GEO or ENA to download public data. I asked this question on twitter, and I will show you how to get the metadata as suggested by all the awesome tweeps.

How to upload files to GEO

readings links: http://yeolab.github.io/onboarding/geo.html http://www.hildeschjerven.net/Protocols/Submission_of_HighSeq_data_to_GEO.pdf https://www.ncbi.nlm.nih.gov/geo/info/submissionftp.html 1. create account Go to NCBI GEO: http://www.ncbi.nlm.nih.gov/geo/ Create User ID and password. my username is research_guru I used my google account. 2. fill in the xls sheet Downloaded the meta xls sheet from https://www.ncbi.nlm.nih.gov/geo/info/seq.html

bgzip the fastqs cd 01seq find *fastq | parallel bgzip md5sum *fastq.gz > fastq_md5.txt # copy to excle cat fastq_md5.txt | awk ‘{print $2}’ #copy to excle cat fastq_md5.