Edit

sanger-tol/
metagenomeassembly

You are viewing the development version pages for this pipeline. The latest stable release is v1.1.1

https://github.com/sanger-tol/metagenomeassembly

Introduction

This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Assembly - Metagenomic assembly of raw PacBio HiFi reads.
Assembly QC - QC of metagenome assemblies including statistics and rRNA identification.
Read mapping - Mapping of PacBio HiFi reads and Illumina Hi-C reads to the assembly for coverage estimation and contact map generation.
Binning - Binning of total metagenome assemblies into genome bins.
Bin refinement - Refining of genome bins by assessing single-copy gene content.
Bin QC - QC of genome bins including basic statistics, rRNA content assessment, and tRNA annotation.
Bin taxonomy - Taxonomic classification of bins with GTDB-Tk and conversion of these classifications to NCBI names.
Pipeline summary - Summarising key information into a final table, scoring and classification of bins into quality categories according to completeness, contamination, tRNA and rRNA content.
Pipeline information - Report metrics generated during the workflow execution.

Assembly

Assembly of raw input HiFi reads.

metaMDBG

metaMDBG is a metagenome assembler for long read (PacBio HiFi and ONT) data.

Output files

assembly/
- fasta/[sampleid]_metamdbg.contigs.fasta.gz: the output assembled contigs.
- log/[sampleid]_metamdbg.metaMDBG.log: log file detailing metaMDBG assembly process.

Assembly QC

Genome assembly statistics (contig counts, length, N50, etc.) tallied using Seqkit, as well as information on the number of circular contigs, and ribosomal RNA annotations using Infernal.

Output files

assembly/qc/
- [sampleid]_[assembler].stats.tsv: TSV of assembly statistics.
- [sampleid]_[assembler].rrna.tbl: TSV of rRNA annotations per contig.
- [sampleid]_[assembler].circles_plasmid_summary.tsv: TSV summarising contig-level classifications as plasmids from Genomad.
- [sampleid]_[assembler].circles_virus_summary.tsv: TSV summarising virus-level classifications as plasmids from Genomad.

Read mapping

Mapping of HiFi reads to the assembly using minimap2, and Hi-C reads to the assembly using bwa-mem2. Mean coverage estimation of contigs using CoverM.

Output files

assembly/mapping/
- [sampleid]_[assembler].minimap2.hifi.bam: Alignment BAM of HiFi reads to the assembly.
- [sampleid]_[assembler].minimap2.hifi.depth.txt: TSV of per-contig mean coverages estimated using CoverM.
- [sampleid]_[assembler].bwa-mem2.hic.bam: Alignment BAM of HiFi reads to the assembly.

Binning

Binning of assembled contigs using MetaBat2, MaxBin2, Bin3C (Hi-C binning), and Metator (Hi-C binning).

Output files

bins/
- fasta/[binner]/*.f(n|ast)a.gz: Bins in gzipped fasta format output by the given binner.
- log/[binner]/*: Log files and other output from each binner.

Refinement of genome bins using DAS_Tool and MagScoT.

Output files

bins/
- fasta/[binner]/*.f(n|ast)a.gz: Bins in gzipped fasta format output by the given binner.
- log/[binner]/*: Log files and other output from each binner.

Bin QC

QC of genome bins, including summary statistics using Seqkit, completeness/contamination assessment using CheckM2, rRNA identification using the assembly rRNA annotations, and tRNA annotation using tRNAscan-SE.

Output files

bins/
- qc/[sampleid]-[assembler]-[binner].stats.tsv: TSV of assembly statistics.
- qc/[sampleid]-checkm2.tsv: TSV of single-copy-gene checking results for all bins from CheckM2.
- qc/trnascan-se/[sampleid]-[assembler]-[binner]*: Bin-level outputs of tRNAScan-SE.
- qc/[sampleid]-[assembler]-[binner].trnascan_summary.tsv: Aggregated summary of tRNAScan-SE results for all bins.
- qc/[sampleid]-[assembler]-[binner].rrna_summary.tsv: Counts of rRNA genes for each bin.

Bin Taxonomy

Taxonomic classification of bins with GTDB-TK and conversion of GTDB taxonomy classifications to NCBI classifications using TaxonKit.

Output files

bins/
- taxonomy/gtdbtk/[sampleid].summary.tsv: GTDB-Tk summary TSV with classifications for each bin.
- taxonomy/gtdbtk/[sampleid]_ncbi.tsv: TSV file containing the GTDB-Tk to NCBI classification translation.
- taxonomy/gtdbtk/[sampleid].classify.tree.gz: Reference tree in Newick format containing query genomes placed with pplacer.
- taxonomy/gtdbtk/[sampleid].markers_summary.tsv: A summary of unique, duplicated, and missing markers within the 120 bacterial marker set, or the 53 archaeal marker set for each submitted genome.
- taxonomy/gtdbtk/[sampleid].*msa.fasta.gz: FASTA files containing MSA of submitted and reference genomes.
- taxonomy/gtdbtk/[sampleid].filtered.tsv: A list of genomes with an insufficient number of amino acids in MSA.
- taxonomy/gtdbtk/[sampleid].failed_genomes.tsv: TSV of genomes which failed classification by GTDB-TK.
- taxonomy/gtdbtk/[sampleid].log: The console output of GTDB-Tk saved to disk.
- taxonomy/gtdbtk/[sampleid].warnings.log: The verbose output of any GTDB-Tk warnings which were encountered.
- taxonomy/[sampleid].ncbi.tsv: TSV file containing the GTDB-Tk to NCBI classification translation, with associated NCBI taxids.

Bin summary

Summarising key information into a final table, scoring and classification of bins into quality categories according to completeness, contamination, tRNA and rRNA content.

Output files

bins/
- [sampleid].bin_summary.tsv: Bin level summary with statistics, completeness/contamination checks, ncRNA content, and taxonomic classifications.
- [sampleid].group_summary.tsv: Aggregated summary for each assembly:binner combination showing the counts of bins in each quality category.

Pipeline information

Output files

pipeline_info/
- Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter's are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
- Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

sanger-tol/metagenomeassembly

Introduction

Pipeline overview

Assembly

metaMDBG

Assembly QC

Read mapping

Binning

Bin refinement

Bin QC

Bin Taxonomy

Bin summary

Pipeline information

sanger-tol/
metagenomeassembly