A review article in the journal Frontiers in Bioinformatics highlights the challenges faced and accomplishments made by the National Microbiome Data Collaborative (NMDC) towards improving the ability to both find, and access comparable multiomic data. Multiomic data for microbial communities includes characterizing all the DNA, RNA, proteins and metabolites in a system. These help describe the genetic content in a microbiome plus the nuances of how and when genetic material is used—specifically, which transcripts and proteins respond to what stimuli and what metabolites are consumed and made. Apply these to entire microbial communities, and you have microbiome multiomics.
The world’s microbiomes are still largely unknown frontiers: microbes in the soil, ocean and human body play important roles in their respective ecosystems but scientists still only “know” a few of them. By pooling together quality, comparable data in the NMDC, scientists are hoping they can begin to elucidate some of these unknowns and hypothesize their functions.
“This availability of data could enable broader and larger research projects,” said Patrick Chain, NMDC co-principaI investigator. “Scientists looking to more comprehensively compare microbiome data, or to compare their own omics data to similar studies from comparable environments, would now be able to find these samples using the NMDC. They would then be able to use NMDC bioinformatics workflows to compare their own data to those processed in identical fashion within the NMDC.”
The NMDC was established in 2019 by the Department of Energy in order to ensure that data is findable, accessible, interoperable and reusable. Reliable data has been lacking, so the NMDC set out to develop standards for cataloguing, processing, analysis and documentation. This includes which bioinformatics workflows to process the data and what terminology should be used for sample metadata, or to document the use of the workflows themselves. Metadata, for example, is especially useful in giving context to a sample.
“While current public repositories contain many datasets from samples collected from water sources, we need to know more,” said Los Alamos computational biologist Bin Hu. “Was it sea water? How deep was the sample collected in the water column? What time of day was it collected? What temperature? These bits of information about the sample provide greater detail and nuance that will help us better understand the results.”
The core partners in the NMDC (including the Lawrence Berkeley Laboratory and Pacific Northwest National Laboratory) have already populated the catalog with hundreds of high-quality samples for use by the larger scientific community, however the ultimate goal is to enable community members to contribute their own data (coupled with metadata and using the standardized NMDC workflows to generate results) to enrich the catalog further.
Hu, Chain and their colleagues at Los Alamos, Berkeley and Pacific Northwest laboratories have been leading the NMDC efforts in standardized workflow development and data processing. The Los Alamos team has also re-tooled their award-winning online bioinformatics package EDGE into a version suited for NMDC microbiome data (NMDC-EDGE) and has been instrumental in developing training and outreach material to help users in the community.
Paper: Challenges in Bioinformatics Workflows for Processing Microbiome Omics Data at Scale, Frontiers in Bioinformatics, 17 January 2022 | https://doi.org/10.3389/fbinf.2021.826370
Funding: The U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research (BER).