Making Big Data work in genetics

Laura Clarke and colleagues report on the data access and management practices of the 1000 Genomes Project [1].

The larger data volumes and shorter read lengths of high-throughput sequencing technologies created substantial new requirements for bioinformatics, analysis and data-distribution methods. The initial plan for the 1000 Genomes Project was to collect 2× whole genome coverage for 1,000 individuals, representing ~6 giga–base pairs of sequence per individual and ~6 tera–base pairs (Tbp) of sequence in total. Increasing sequencing capacity led to repeated revisions of these plans to the current project scale of collecting low-coverage, ~4× whole-genome and ~20× whole-exome sequence for ~2,500 individuals plus high-coverage, ~40× whole-genome sequence for 500 individuals in total (~25-fold increase in sequence generation over original estimates). In fact, the 1000 Genomes Pilot Project collected 5 Tbp of sequence data, resulting in 38,000 files and over 12 terabytes of data being available to the community. In March 2012 the still-growing project resources include more than 260 terabytes of data in more than 250,000 publicly accessible files.

The paper acknowledges that this large-scale genetic sequencing project nevertheless generates far less data than physics and astronomy projects. The Large Synoptic Survey Telescope, for example, will generate 20 terabytes each night of operation, while the Large Hadron Collider will generate roughly 15 petabytes per year. The 1000 Genomes Project data to date add up to around two weeks of LSST operation. Still, it's not hard to see how high-coverage sequencing will start to catch up in data storage and transfer requirements.

We are now in a golden age of data centralization. But five years from now, we may return to a second era of disposable data, as gene expression and whole-genome resequencing studies will generate far more data than any central repository can store. We will need curation practices to identify and preserve data that have value beyond the project for which they were collected.

The beautiful thing about this is that when data are abundant, they don't all have to work together. There is a real role for a new generation of curators to facilitate the mashups of the future.

References

Clarke L, Zheng-Bradley X, Smith R, Kulesha E, Xiao C, Toneva I, Vaughan B, Preuss D, Leinonen R, Shumway M, et al. 2012. The 1000 Genomes Project: data management and community access. Nature methods 9:459-462.

Tags:

data access

archiving

curation

genetics

gene expression

Making Big Data work in genetics

References

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...