Edit

sanger-tol/
ascc

A Nextflow DSL2 pipeline for the identification of cobiont and contaminating sequences using fasta and pacbio data.

You are viewing the development version pages for this pipeline. The latest stable release is v0.3.0

Introduction

This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Processes that produce intermediate outputs:

YamlInput
Generate samplesheet
Validate TaxID
Generate Genome
Filter Fasta
GC Content
Get kmers profile
Extract Tiara Hits
Run FCS-GX
Run nt Kraken
nr Diamond BLASTX
Uniprot Diamond BLASTX

Main outputs

Trailing Ns Check

Output files

trailingns/ *_trim_Ns - A text file containing a report of trailing Ns found in the genome.

A text file containing a report of trailing Ns found in the genome. Trailing Ns are when a nucleotide sequence starts or ends with Ns instead of A, G, C or T nucleotides. It is advisable to trim off the trailing Ns from sequences in the assembly. If the sequence remaining after trimming is shorter than 200 bp, the script recommends removing it from the assembly.

Trailing Ns Workflow

Mito Organellar Blast

Output files

mito_organellar_blast/ *-mitochondrial_genome.contamination_recommendation - A file that contains the names of sequences that are suspected mitochondrial contaminants in the nuclear DNA assembly, tagged as either "REMOVE" or "Investigate" depending on the BLAST hit alignment length and percentage identity. The file is empty if there are no suspected mitochondrial contaminants.

This subworkflow uses BLAST against a user-provided mitochondrial sequence to detect leftover organellar sequences in the assembly file that should contain only chromosomal DNA sequences. A BLAST nucleotide database is made from the user-provided organellar sequence. BLAST with the chromosomal DNA assembly is then ran against this database with the following settings: -task megablast -word_size 28 -best_hit_overhang 0.1 -best_hit_score_edge 0.1 -dust yes -evalue 0.0001 -perc_identity 80 -soft_masking true. The BLAST results are filtered to keep only hits with alignment length that is at least 200 bp. Depending on the alignment length and percentage identity, the script can recommend an action for dealing with the putative organellar sequence: either "REMOVE" or "Investigate".

Organellar Blast Workflow

Plastid Organellar Blast

plastid_organellar_blast/ *-plastid_genome.contamination_recommendation - A file that contains the names of sequences that are suspected plastid contaminants in the nuclear DNA assembly, tagged as either "REMOVE" or "Investigate" depending on the BLAST hit alignment length and percentage identity. The file is empty if there are no suspected mitochondrial contaminants.

This subworkflow uses BLAST against a user-provided plastid sequence to detect leftover organellar sequences in the assembly file that should contain only chromosomal sequences. The method is the same as in the Mito Organellar Blast part.

Organellar Blast Workflow

Run FCS-adaptor

Output files

fcs_adaptor/ *.fcs_adaptor_report.txt - A text file containing potential adaptor sequences and locations. *.cleaned_sequences.fa.gz - Cleaned FASTA file. *.fcs_adaptor.log - Log of the FCS-adaptor run. *.pipeline_args.yaml - Arguments to FCS-adaptor *.skipped_trims.jsonl - Skipped sequences

FCS-adaptor (https://github.com/ncbi/fcs) is NCBI software for detecting adapter contamination in genome assemblies. FCS-adaptor uses a built-in database of adapter sequences, provided by NCBI. The FCS-adaptor report shows identified potential locations of retained adapter sequences from the sequencing run.

RUN FCS ADAPTOR

Pacbio Barcode Check

Output files

filter_barcode/ *_filtered.txt - Text file log of PacBio barcode sequences found in the genome. The file is empty if no contamination was found.

Uses BlastN to identify retained PacBio multiplexing barcode contamination in the assembly. The PacBio multiplexing barcode sequences are stored as the pacbio_adaptors.fa file in the assets directory of this pipeline.

pacbiocheck

Run Read Coverage

Output files

sorted_mapped_bam/ *.bam - BAM file with aligned reads.
average_coverage/ *_average_coverage.txt - Text file containing the coverage information for the genome

Mapping the read data to the input genome with minimap2 (https://github.com/lh3/minimap2) and calculating the average coverage per sequence. The reads used for mapping can be PacBio HiFi reads or paired end Illumina reads.

read coverage

Run VecScreen

Output files

summarise_vecscreen_output/ *.vecscreen_contamination - A text file containing potential vector contaminant locations. The file is empty if no potential contaminants were found.

VecScreen (https://www.ncbi.nlm.nih.gov/tools/vecscreen/) is a tool for detecting adapter and vector contamination in genome assemblies. It is an older tool than FCS-adaptor. Its advantage over FCS-adaptor is that it can use a custom database of contaminant sequences made by the user, whereas FCS-adaptor comes with its built-in database.

vecscreen

Create BTK Dataset

Output files

create_btk_dataset/ btk_datasets/ - A BlobToolKit (https://blobtoolkit.genomehubs.org) dataset folder containing data compatible with BTK viewer (https://blobtoolkit.genomehubs.org/blobtools2/blobtools2-tutorials/opening-a-dataset-in-the-viewer/). btk_summary_table_full.tsv - A TSV file summarising the contents of the BlobToolKit dataset. This file is created using the blobtools filter --table command of BlobToolKit.

Creates a BlobToolKit dataset folder compatible with BlobToolKit viewer (https://blobtoolkit.genomehubs.org/blobtools2/blobtools2-tutorials/opening-a-dataset-in-the-viewer/). The BlobToolKit dataset create by ASCC can contain much more variables than what the BlobToolKit pipeline (https://github.com/sanger-tol/blobtoolkit) produces.

Autofilter and Check Assembly

Output files

autofilter/ autofiltered.fasta - The decontaminated input genome. The decontamination is based on the results of FCS-GX. ABNORMAL_CHECK.csv - Combined FCS-GX and Tiara summary of contamination. assembly_filtering_removed_sequences.txt - Sequences deemed contamination by FCS-GX (labelled with the EXCLUDE tag by FCS-GX) and removed from the above assembly. fcs-gx_alarm_indicator_file.txt - Contains text to control the running of BlobToolKit pipeline. If enough contamination is found by FCS-GX, an alarm is triggered to switch on the running of BlobToolKit pipeline.

Autofilter and check assembly returns a decontaminated genome file as well as summaries of the contamination found.

Sanger-TOL BTK

Output files

sanger_tol_btk/ *_btk_out/blobtoolkit/${meta.id}*/ - The BlobToolKit dataset folder generated by the sanger-tol/blobtoolkit pipeline. *_btk_out/blobtoolkit/plots/ - BlobToolKit plots as PNG images, exported from the BlobToolKit dataset using blobtk (https://pypi.org/project/blobtk/). *_btk_out/blobtoolkit/${meta.id}*/summary.json.gz - The summary.json.gz file of the BlobToolKit dataset. It contains assembly metrics such as *_btk_out/busco/* - The BUSCO results returned by BlobToolKit. *_btk_out/multiqc/* - The MultiQC results returned by BlobToolKit. blobtoolkit_pipeline_info - The pipeline_info folder.

Sanger-Tol/BlobToolKit (https://github.com/sanger-tol/blobtoolkit) is a Nextflow re-implementation of the Snakemake based BlobToolKit pipeline (https://github.com/blobtoolkit/pipeline) and produces interactive plots used to identify contamination or cobionts and separate these sequences from the main assembly.

Merge BTK Datasets

Output files

merged_tables/ merged_datasets - A BTK dataset. merged_datasets/btk_busco_summary_table_full.tsv - A TSV file containing a summary of the btk busco results.

This module merged the Create_btk_dataset folder with the Sanger-tol BTK dataset to create one unified dataset for use with btk viewer.

ASCC Merge Tables

Output files

ascc_main_output/ *_contamination_check_merged_table.csv - A CSV table that contains the results of most parts of the pipeline (GC content, coverage, Tiara, Kraken, kmers dimensionality reduction, Diamond, BLAST, FCS-GX, BlobToolKit pipeline) for each sequence in the input assembly file. If a set of prerequisite steps have been run (nt BLAST, nr Diamond, Uniprot Diamond, read mapping for coverage calculation, Tiara, nt Kraken and the creation of a BlobToolKit dataset), the pipeline tries to put together a phylum level combined classification of the input sequences. It first uses BlobToolKit's bestsum_phylum, then fills the gaps (caused by no-hit sequences) with results from Tiara and then the remaining gaps are filled with results from nt Kraken. The combined classification is in the merged_classif column. The merged_classif_source column says which tool's output the classification for each sequence is based on. The automated classification usually has some flaws in it but is still useful as a starting point for determining the phyla that the input sequences belong to. *_phylum_counts_and_coverage.csv - A CSV report containing information on the hits per phylum and the average coverage per phylum. This file can only be generated if themerged_classif variable has been produced in the *_contamination_check_merged_table.csv table, as described above.

Merge Tables merged the summary reports from a number of modules in order to create a single set of reports.

Pipeline Information

Output files

pipeline_info/
- Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter's are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
- Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

Intermediate outputs

These files are produced by the pipeline's modules but they are stay in Nextflow's work directory and are not included on their own in the final output.

Filter FASTA

Output files

filtered_fasta/ *filtered.fasta - A FASTA file that has been filtered to keep sequences below a given threshold of length.

By default scaffolds above 1.9 Gb are removed from the assembly, as scaffolds of this size are unlikely to truely have contamination. There is also the issue that scaffolds larger than this use a significant amount of resources which hinders production environments. Furthermore, FCS-GX does not work with sequences larger than 2 Gb.

GC Content

Output files

gc_content/ *-GC_CONTENT.txt - A tab separated table describing the GC content of the input genome. The first column contains the sequence names and the second column contains the GC content of each sequence. The GC content is expressed as a fraction: number of G and C nucleotides in the sequence divided by the number of all nucleotides in the sequence.

Calculating the GC content of each sequence in the input genome.

Generate Genome

Output files

generate_genome/ *.genome - An index-like file describing the input genome.

An index-like file containing the scaffold and scaffold length of the input genome.

Get kmers Profile

Output files

kmer_data/ *_KMER_COUNTS.csv - A CSV file containing the counts of kmers (by default: 7mers) in each sequence in the assembly. KMERS_dim_reduction_embeddings_combined.csv - A CSV file with the results of dimensionality reduction of kmer counts. The dimensionality reduction embeddings help to separate sequences in the assembly by their origin (sequences originating from the same species likely appear close together in an embedding). When setting up a run, the user can choose multiple methods for dimensionality reduction.

A CSV file containing the counts of kmers (by default: 7mers) in each sequence in the assembly. Also, a file with the results of dimensionality reduction of kmer counts. The following dimensionality reduction methods are available: PCA (principal component analysis), kernel PCA, PCA with SVD (singular value decomposition) solver, UMAP (uniform manifold approximation and projection), t-SNE (t-distributed stochastic neighbor embedding), LLE (locally linear embedding), MDS (multidimensional scaling), SE (spectral embedding), random trees, autoencoder and NMF (non-negative matrix factorisation). The first two dimensions of the dimensionality reduction embeddings are used as the x and y coordinate when visualising the results in BlobToolKit.

Extract Tiara Hits

Output files

tiara_raw_output/ TIARA.txt - A text file containing classifications of the input DNA sequences. Each sequence gets assigned one label out of these: archaea, bacteria, prokarya, eukarya, organelle and unknown. log_*.{txt} - A log of the Tiara run.

Tiara (https://github.com/ibe-uw/tiara) uses a neural network to classify DNA sequences.

Run FCS-GX

Output files

fcsgx_data/ *out/*.fcs_gx_report.txt - A text file containing potential contaminant locations. out/*.taxonomy.rpt - Taxonomy report of the potential contaminants.

FCS-GX (https://github.com/ncbi/fcs) is NCBI software that detects contaminants in genome assemblies using a cross species aligner. It uses its own database, provided by NCBI.

Run nt Kraken

Output files

kraken2_data/ *.kraken2.classifiedreads.txt - A text file containing classifications for each input DNA sequence, generated by Kraken2. *.kraken2.report.txt - Summary of the Kraken2 run, generated by Kraken2. _nt_kraken_lineage_file.txt - Kraken2 lineages for each input DNA sequence, reformatted as a CSV table to make it possible to merge this information into a table that contains sequence classifications from other tools, e.g. BLAST and Diamond.

Kraken (https://github.com/DerrickWood/kraken2) assigns taxonomic labels to input DNA sequences based on comparing them to a database of kmers. ASCC uses a Kraken database made from the sequences of the NCBI nt database. The FASTA sequences of NCBI nt database are available at https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/.

nr Diamond BLASTX

Output files

nr_diamond/ *.txt - A tabular text file containing the raw output of running Diamond BLASTX with sampled chunks of the assembly. The file contains BLASTX hits and scores/ Format: outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids sscinames sskingdoms sphylums salltitles full_coords.tsv: A tabular text file containing the results from Diamond BLASTX where the coordinates of the BLASTX of chunks of assembly have been converted to coordinates in the full sequences of the assembly. *_diamond_blastx_top_hits.csv - A file containing Diamond BLASTX top hits for each sequence in the input assembly file. *_diamond_outfmt6.tsv - the full_coords.tsv file reformatted to make it compatible with BlobToolKit, so that the hits in it can be added to a BlobToolKit dataset. Format: outfmt 6 qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

Diamond (https://github.com/bbuchfink/diamond) is a sequence aligner for protein sequences and translated nucleotide sequences. Here it is used to identify contamination using the NCBI nr database. The FASTA sequences of NCBI nr database are available at https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/.

Uniprot Diamond BLASTX

Output files

up_diamond/ *.txt - A tabular text file containing the raw output of running Diamond BLASTX with sampled chunks of the assembly. The file contains BLASTX hits and scores/ Format: outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids sscinames sskingdoms sphylums salltitles full_coords.tsv: A tabular text file containing the results from Diamond BLASTX where the coordinates of the BLASTX of chunks of assembly have been converted to coordinates in the full sequences of the assembly. *_diamond_blastx_top_hits.csv - A file containing Diamond BLASTX top hits for each sequence in the input assembly file. *_diamond_outfmt6.tsv - the full_coords.tsv file reformatted to make it compatible with BlobToolKit, so that the hits in it can be added to a BlobToolKit dataset. Format: outfmt 6 qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

Diamond (https://github.com/bbuchfink/diamond) is a sequence aligner for protein sequences and translated nucleotide sequences. Here it is used to identify contamination using the Uniprot database.

Generate Samplesheet

Output files

generate_samplesheet/ *.csv - A CSV file containing data locations, for use in BlobToolkit.

This produces a CSV containing information on the read data for use in BlobToolKit.

sanger-tol/ascc

Introduction

Pipeline overview

Processes that produce the main outputs:

Processes that produce intermediate outputs:

Main outputs

Trailing Ns Check

Mito Organellar Blast

Plastid Organellar Blast

Run FCS-adaptor

Pacbio Barcode Check

Run Read Coverage

Run VecScreen

Create BTK Dataset

Autofilter and Check Assembly

Sanger-TOL BTK

Merge BTK Datasets

ASCC Merge Tables

Pipeline Information

Intermediate outputs

Filter FASTA

GC Content

Generate Genome

Get kmers Profile

Extract Tiara Hits

Run FCS-GX

Run nt Kraken

nr Diamond BLASTX

Uniprot Diamond BLASTX

Generate Samplesheet

sanger-tol/
ascc