OSDR-Data Processing - Metagenomics

Metagenomics
README
Raw sequence data and QC
Filtered sequence data and QC
Assembly-based processing
Assemblies
Predicted-genes
Read-mapping
Annotations-and-taxonomy
Combined-outputs
Bins
MAGs
Read-based processing
Processing info

Metagenomics

Image shows the Amplicon sequencing methodology.

Image shows the Metagenomics sequencing methodology.

Metagenomics and Amplicon sequencing are two commonly used approaches in microbial ecology. The image above, also found here, provides a summary of metagenomics versus amplicon sequencing methodologies.

Metagenomics, or shotgun metagenomic sequencing, aims to amplify all the accessible DNA of a mixed community. Subsequent analysis of metagenomic sequence data provides a window into the taxonomy and functional potential of a sample. Metagenomics also enables the potential recovery of Metagenome-Assembled Genomes (MAGs). A MAG is typically different than an isolate genome and is more like a reconstructed representative genome of a very closely related microbial population, which could match an existing isolate or represent a novel isolate, that was in the microbial system sampled. The recovery of MAGs has become a powerful approach in microbial ecology, drastically expanding the known Tree of Life by granting us genomic access to as-yet unculturable microbes (e.g., Hug et al. 2016; Parks et al. 2017).

Some common uses of metagenomics data include:

providing information on taxonomy and functional potential within a mixed community
tracking changes in taxonomy and functional potential within a mixed community in response to a treatment and/or across environmental gradients or time
enabling insight into the genomes of as-yet unculturable and/or undescribed microbes

Image shows the Metagenomics sequencing data processing pipeline.

Additional information on metagenomics (and amplicon sequencing) can be found at Happy Belly Bioinformatics. The GeneLab metagenomics consensus processing pipeline is designed to identify taxonomic changes and functional potential in biological samples exposed to the space environment by processing raw shotgun metagenomic sequence data through contig-level taxonomic assignments, functional annotation, and recovery of MAGs as summarized in the diagram above.

Specific details of each step of the pipeline, including previous and current pipeline versions, are available on the Metagenomics page of the GeneLab DP GitHub Repository.

The primary data products and respective quality control (QC) analyses generated from each step of the pipeline, as described below, are available for each GeneLab metagenomics dataset hosted on the OSDR under ‘Study Files’. Note that all human reads are scrubbed from raw metagenomic sequence data prior to publishing on the OSDR. Below are the metagenomics data files that are published:

README

*README.txt: A text file describing the output directory structure and data files

Raw sequence data and QC

*HRremoved_raw.fastq.gz: Raw sequence data after human read removal, commonly referred to as raw reads
*raw_multiqc_report.zip: Combined fastQC analyses and respective html report of raw sequence data

Filtered sequence data and QC

*filtered.fastq.gz: Trimmed and quality-filtered sequence data, commonly referred to as filtered reads
*filtered_multiqc_report.zip: Combined fastQC analyses and respective html report of trimmed and filtered sequence data

Assembly-based processing

*Assembly-based-processing-overview.tsv: Table containing information about samples throughout the assembly-based processing

Assemblies

*assembly.fasta: Contig-renamed assembled sequences
*assembly-summaries.tsv: Table containing assembly summary statistics

Predicted-genes

*genes.fasta: Recovered gene nucleotide sequences
*genes.faa: Recovered gene amino acid sequences
*genes.gff: Recovered gene calls in general feature format

Read-mapping

*mapping-info.txt: Text file containing bowtie2 alignment info/stats
*metabat-assembly-depth.tsv: Table containing quantification of contig coverage for each sample
*.bam: Sequence alignment map containing reads mapping to the respective sample *assembly.fasta reference, sorted by coordinate, in binary format

Annotations-and-taxonomy

*contig-coverage-and-tax.tsv: Table containing combined contig coverage and taxonomy info for each sample
*gene-coverage-annotation-and-tax.tsv: Table containing combined gene coverage, annotation, and taxonomy info for each sample

Combined-outputs

*Combined-contig-level-taxonomy-coverages.tsv: Table containing unnormalized coverage quantification of contigs across all samples, summed based on contig-level assigned taxonomy
*Combined-contig-level-taxonomy-coverages-CPM.tsv: Table containing coverage quantification of contigs across all samples, summed based on contig-level assigned taxonomy, normalized as coverage per million (CPM)
*Combined-gene-level-taxonomy-coverages.tsv: Table containing unnormalized coverage quantification of genes across all samples, summed based on gene-level taxonomic classifications
*Combined-gene-level-taxonomy-coverages-CPM.tsv: Table containing coverage quantification of genes across all samples, summed based on gene-level assigned taxonomy, normalized as coverage per million (CPM)
*Combined-gene-level-KO-function-coverages.tsv: Table containing unnormalized coverage quantification of Kyoto Encyclopedia of Genes and Genomes (KEGG) Orthology (KO) annotations across all samples, summed by KO identifier
*Combined-gene-level-KO-function-coverages-CPM.tsv: Table containing coverage quantification of KO annotations across all samples, summed by KO identifier, and normalized as coverage per million (CPM)

Bins

*bin.*.fasta: Recovered assembled metagenomic contig sequences grouped into bins believed to represent individual microbial populations

MAGs

*MAG.*.fasta: High-quality MAGs recovered, if any, having greater than 90% estimated completeness and less than 10% estimated redundancy
*MAGs-overview.tsv: Table containing information about each recovered MAG including general stats, quality estimates, and assigned taxonomy
*MAG-level-KO-annotations.tsv: Table containing MAGs and their respective KO annotations
*MAG-KEGG-Decoder-out.tsv: Table containing MAGs and their respective proportions of genes known to be required for specific pathways/metabolisms
*MAG-KEGG-Decoder-out.html: Interactive heatmap html file representing the proportions of genes known to be required for specific pathways/metabolisms in recovered MAGs

Read-based processing

*Gene-families.tsv: Table containing unnormalized gene-family quantification for all samples
*Gene-families-grouped-by-taxa.tsv: Table containing unnormalized gene-family quantification grouped by taxa for all samples
*Gene-families-cpm.tsv: Table containing gene-family quantification normalized as copies-per-million (CPM) for all samples
*Gene-families-KO-cpm.tsv: Table containing KO term quantification normalized as copies-per-million (CPM) for all samples
*Pathway-abundances.tsv: Table containing unnormalized pathway quantification for all samples
*Pathway-abundances-grouped-by-taxa.tsv: Table containing unnormalized pathway quantification grouped by taxa for all samples
*Pathway-abundances-cpm.tsv: Table containing pathway quantification normalized as copies-per-million (CPM) for all samples
*Pathway-coverages.tsv: Table containing pathway coverages for all samples
*Pathway-coverages-grouped-by-taxa.tsv: Table containing pathway coverages grouped by taxa for all samples
*Metaphlan-taxonomy.tsv: Table containing metaphlan estimated taxonomic relative abundances for all samples

Processing info

*processing_info.tar: Tarred file containing the exact processing information for each step of the pipeline

To date, GeneLab has baselined pipelines for the following omics data types (click on a data type link to learn more):

	Stay connected with the NASA Open Science Data Repository:

Suggested Searches

Featured

NASA Mars Orbiter Learns New Moves After Nearly 20 Years in Space

NASA, Australia Team Up for Artemis II Lunar Laser Communications Test

What’s Up: June 2025 Skywatching Tips from NASA

Missions

Humans in Space

Earth

The Solar System

The Universe

Science

Aeronautics

Technology

Learning Resources

About NASA

NASA en Español

News & Events

Multimedia

Featured

Curiosity Blog, Sols 4580-4581: Something in the Air…

By Air and by Sea: Validating NASA’s PACE Ocean Color Instrument

Hubble Captures an Active Galactic Center

Highlights

NASA Announces Winners of 2025 Human Lander Challenge

NASA, Australia Team Up for Artemis II Lunar Laser Communications Test

Testing NASA-Developed Heat Shield Made by U.S. Company

Highlights

By Air and by Sea: Validating NASA’s PACE Ocean Color Instrument

NASA-Assisted Scientists Get Bird’s-Eye View of Population Status

NASA Tech to Use Moonlight to Enhance Measurements from Space

Highlights

NASA Mars Orbiter Learns New Moves After Nearly 20 Years in Space

NASA’s Perseverance Rover Scours Mars for Science

NASA’s Curiosity Mars Rover Starts Unpacking Boxwork Formations

Featured

Hubble Captures an Active Galactic Center

NASA Citizen Scientists Find New Eclipsing Binary Stars

NASA’s Webb Digs into Structural Origins of Disk Galaxies

Highlights

Hubble Captures an Active Galactic Center

New Visualization Portal

By Air and by Sea: Validating NASA’s PACE Ocean Color Instrument

Highlights

NASA Air Taxi Passenger Comfort Studies Move Forward

NASA Aircraft to Make Low-Altitude Flights in Mid-Atlantic, California

NASA Tech to Measure Heat, Strain in Hypersonic Flight

Highlights

NASA-Assisted Scientists Get Bird’s-Eye View of Population Status

Testing NASA-Developed Heat Shield Made by U.S. Company

Heliophysics – Research and Development of Initiatives of Advanced New Technologies (RADIANT) Program

Featured

Career Spotlight: Mathematician (Ages 14-18)

Featured

NASA Citizen Scientists Find New Eclipsing Binary Stars

NASA’s Webb Digs into Structural Origins of Disk Galaxies

NASA, Australia Team Up for Artemis II Lunar Laser Communications Test

Highlights

Las carreras en la NASA despegan con las pasantías

El X-59 de la NASA completa las pruebas electromagnéticas

Mejores imágenes de Ciencia en la estación espacial 2024

Contents

Metagenomics

README

Raw sequence data and QC

Filtered sequence data and QC

Assembly-based processing

Assemblies

Predicted-genes

Read-mapping

Annotations-and-taxonomy

Combined-outputs

Bins

MAGs

Read-based processing

Processing info

Stay connected with the NASA Open Science Data Repository: