Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

YamlInput

Output files
  • NA

YamlInput parses the input yaml into channels for later use in the pipeline.

Validate TaxID

Output files
  • NA

Validate TaxID scans through the taxdump to ensure that the input taxid is present in the nxbi taxdump.

Filter Fasta

Output files
  • filter/ *filtered.fasta - A fasta file that has been filtered for sequences below a given threshold.

By default scaffolds above 1.9Gb are removed from the assembly, as scaffolds of this size are unlikely to truely have contamination. There is also the issue that scaffolds larger than this use a significant amount of resources which hinders production environments.

GC Content

Output files
  • gc/ *-GC_CONTENT.txt - A text file describing the GC content of the input genome.

Calculating the GC content of the input genome.

Generate Genome

Output files
  • generate/ *.genome - An index-like file describing the input genome.

An index-like file containing the scaffold and scaffold length of the input genome.

Trailing Ns Check

Output files
  • trailingns/ *_trim_Ns - A text file containing a report of the Ns found in the genome.

A text file containing a report of the Ns found in the genome.

Get KMERS profile

Output files
  • get/ *_KMER_COUNTS.csv - A csv file containing kmers and their counts.

A csv file containing kmers and their counts.

Extract Tiara Hits

Output files
  • tiara/ *.{txt,txt.gz} - A text file containing classifications of potential contaminants. log_*.{txt,txt.gz} - A log of the tiara run. *.{fasta,fasta.gz} - An output fasta file.

Tiara ...

Mito Organellar Blast

Output files
  • blast/ *.tsv - A tsv file containing potential contaminants.

A BlastN based subworkflow used on the input genome to filter potential contaminants from the genome.

Chloro Organellar Blast

Output files
  • blast/ *.tsv - A tsv file containing potential contaminants.

A BlastN based subworkflow used on the input genome to filter potential contaminants from the genome.

Run FCS Adaptor

Output files
  • fcs/ *.fcs_adaptor_report.txt - A text file containing potential adaptor sequences and locations. *.cleaned_sequences.fa.gz - Cleaned fasta file. *.fcs_adaptor.log - Log of the fcs run. *.pipeline_args.yaml - Arguments to FCS Adaptor *.skipped_trims.jsonl - Skipped sequences

FCS Adaptor Identified potential locations of retained adaptor sequences from the sequencing run.

Run FCS-GX

Output files
  • fcs/ *out/*.fcs_gx_report.txt - A text file containing potential contaminant locations. out/*.taxonomy.rpt - Taxonomy report of the potential contaminants.

FCS-GX Identified potential locations of contaminant sequences.

Pacbio Barcode Check

Output files
  • filter/ *_filtered.txt - Text file of barcodes found in the genome.

Uses BlastN to identify where given barcode sequences may be in the genome.

Run Read Coverage

Output files
  • samtools/ *.bam - Aligned BAM file. *_average_coverage.txt - Text file containing the coverage information for the genome

Mapping the read data to the input genome and calculating the average coverage across it.

Run Vecscreen

Output files
  • summarise/ *.vecscreen_contamination - A text file containing potential vector contaminant locations.

Vecscreen identifies vector contamination in the input sequence.

Run NT Kraken

Output files
  • kraken2/ *.classified{.,_}*' - Fastq file containing classified sequence. *.unclassified{.,_}*' - Fastq file containing unclassified sequence. *classifiedreads.txt - A text file containing a report on reads which have been classified. *report.txt - Report of Kraken2 run.
  • get/ *txt - Text file containing lineage information of the reported meta genomic data.

Kraken assigns taxonomic labels to metagenomic DNA sequences and optionally outputs the fastq of these data.

Nucleotide Diamond Blast

Output files
  • diamond/ *.txt - A text file containing the genomic locations of hits and scores.
  • reformat/ *text - A Reformated text file continaing the full genomic location of hits and scores.
  • convert/ *.hits - A file containing all hits above the cutoff.

Diamond Blast is a sequence aligner for translated and protein sequences, here it is used do identify contamination usin the NCBI db

Uniprot Diamond Blast

Output files
  • diamond/ *.txt - A text file containing the genomic locations of hits and scores.
  • reformat/ *text - A Reformated text file continaing the full genomic location of hits and scores.
  • convert/ *.hits - A file containing all hits above the cutoff.

Diamond Blast is a sequence aligner for translated and protein sequences, here it is used do identify contamination usin the Uniprot db

Create BTK dataset

Output files
  • create/ btk_datasets/ - A btk dataset folder containing data compatible with BTK viewer. btk_summary_table_full.tsv - A TSV file summarising the dataset.

Create BTK, creates a BTK_dataset folder compatible with BTK viewer.

Autofilter and check assembly

Output files
  • autofilter/ autofiltered.fasta - The decontaminated input genome. ABNORMAL_CHECK.csv - Combined FCS and Tiara summary of contamination. assembly_filtering_removed_sequences.txt - Sequences deemed contamination and removed from the above assembly. fcs-gx_alarm_indicator_file.txt - Contains text to control the running of Blobtoolkit.

Autofilter and check assembly returns a decontaminated genome file as well as summaries of the contamination found.

Generate samplesheet

Output files
  • generate/ *.csv - A CSV file containing data locations, for use in Blobtoolkit.

This produces a CSV containing information on the read data for use in BlobToolKit.

Sanger-TOL BTK

Output files
  • sanger/ *_btk_out/blobtoolkit/${meta.id}*/ - The BTK dataset folder generated by BTK. *_btk_out/blobtoolkit/plots/ - The plots for display in BTK Viewer. *_btk_out/blobtoolkit/${meta.id}*/summary.json.gz - The Summary.json file... *_btk_out/busco/* - The BUSCO results returned by BTK. *_btk_out/multiqc/* - The MultiQC results returned by BTK. blobtoolkit_pipeline_info - The pipeline_info folder.

Sanger-Tol/BlobToolKit is a Nextflow re-implementation of the snakemake based BlobToolKit pipeline and produces interactive plots used to identify true contamination and seperate sequence from the main assembly.

Merge BTK datasets

Output files
  • merge/ merged_datasets - A BTK dataset. merged_datasets/btk_busco_summary_table_full.tsv - A TSV file containing a summary of the btk busco results.

This module merged the Create_btk_dataset folder with the Sanger-tol BTK dataset to create one unified dataset for use with btk viewer.

ASCC Merge Tables

Output files
  • ascc/ *_contamination_check_merged_table.csv - .... *_contamination_check_merged_table_extended.csv - .... *_phylum_counts_and_coverage.csv - A CSV report containing information on the hits per phylum and the coverage of the hits..

Merge Tables merged the summary reports from a number of modules inorder to create a single set of reports.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter's are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.