M5

M5 GSC project description 2012

Project Title: M5 — A meta-infrastructure enabling exchange of large (metagen)omics data sets

Project Leads:

Folker Meyer, folker@anl.gov, Argonne National Laboratory
Rob Finn, rdf@ebi.ac.uk, EBI

Team members:

Nikos Kyrpides, NCKyrpides@lbl.gov, DOE JGI
Andreas Wilke, wilke@mcs.anl.gov, Argonne National Laboratory
Konstantinos Mavrommatis, KMavrommatis@lbl.gov, DOE JGI (left JGI)
Jeff Grethe, jgrethe@ncmir.ucsd.edu, UCSD
Folker Meyer, folker@anl.gov, Argonne National Laboratory
Sarah Hunter, hunter@ebi.ac.uk, EBI
Dawn Field (CEH)

Elevator Pitch Large ‘omics data sets are changing the way we do business in e.g. metagenomics. Computational analyses are rate limiting, computational cost now dwarf the sequencing cost for several types of experiments. The M5 project aims at providing critical pieces of infrastructure enabling sharing not just of raw data, but also derived data products.

Project Summary Large quantities of data in ever growing data sets pose significant infrastructure challenges to biologists and bioinformaticians. The old, very loosely integrated approaches relying on the INSDC network for sequence data sharing are still important, however additional layers of data infrastructure (standards driven) will emerge over time simply driven by the cost of data analysis.
Already review of scientific papers for shotgun metagenomic data sets is problematic as the cost for computational re-analysis is significant.

Only by sharing derived results in robust ways can the community overcome the computational burden. Basically speaking, minimizing the number of times a particular data sets is undergoing a specific analysis will maximize the amount of analyses the community as a whole can perform. Technology, Standards and community buy-in are required and the group is working on creating the missing pieces of a more complete data sharing ecosystem.

Project initiation date M5 was started at GSC in Stockholm during a late night discussion between several GSC members. The notion of creating exchangeable data products is becoming more and more important as data set sizes and numbers grow.
Currently the M5 project has produced a unified reference database (M5NR) that allows sharing data mapped to the database and circumvents problems with specific in house database solutions.

What will this project aim to contribute to the GSC? The sharing of computed results will be a requirement for future biology. M5 is aiming at providing the standards, technologies and community buy-in required for that to function.

Have you spoken about the project already within GSC? Several times at GSC meetings and during the call.

Which existing projects, if any, does this one replace/complement/subsume/expand? Explain briefly why an extra project is needed/justified N/A

How does this project fit into GSC’s mission statement ? Driven by changing technology, more and more data sets are being created, the bottleneck is moving from data set creation to data set analysis. Computational bottlenecks are to be found everywhere, by creating exchangeable data sets analyses can be re-used, allowing massive cost savings. An example of this is the M5nr allowing the exchange of annotations for metagenomes in an abstract format that can be mapped into several namespaces, removing the need to re-blast data sets for use with another analysis framework that requires a specific set of annotations.

The goal of M5 is to establish a data and analysis sharing infrastructure, initially for shotgun metagenomic data (an area of rapid growth) building tools and standards that can be applied to other data types and type of experiments.

Will you start a GSC working group (how far along are you?)? If not, why not (i.e. subgroup within developers group, existing external community, etc). There is an m5 working group, meeting at irregular intervals.

How do you wish to further engage the GSC(recruit members to project, get consultation, link to other GSC projects, etc)? N/A

Do you already have a website or do you wish to create a home page for the project in the GSC website ? Yes.

What other resources might you like from what the GSC can offer (mailing lists, etc) ? N/A

What kind of timeline are you working to for building consensus, releasing a first version etc ? M5 will need to bring together several other pieces of data sharing infrastructure.

How is this work currently funded (list grants, funders, in kind contributions, etc)? There was a small DOE grant to Nikos Kyrpides and Folker Meyer. This grant has ended. Currently M5 is unfunded, but work in DOE KBASE will be

What resources will be required for completion (funding, man power, etc.)? Creating exchangeable data formats and defining semantics that allow exchange is a long term project. While funding is desirable, it is expected to be continued by peering of existing analysis providers (e.g. MG-RAST, DOE KBASE, DOE JGI, EBI Metagenomics portal, CAMERA, etc).

What are your current plans for publishing/promoting the project? The plan is to continue a series of publication of data products and standards.

References or relevant websites (for further reading)
A. Wilke, T. Harrison, J. Wilkening, D. Field, E. M. Glass, N. Kyrpides, K. Mavrommatis and F. Meyer, The M5nr: a novel non-redundant database containing protein sequences and annotations from multiple sources and associated tools, BMC Bioinformatics, 2012, http://www.biomedcentral.com/1471-2105/13/141

↑