Introduction
This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- YamlInput -
- Validate TaxID -
- Filter Fasta -
- GC Content -
- Generate Genome -
- Trailing Ns Check -
- Get KMERS profile -
- Extract Tiara Hits -
- Mito organellar blast -
- Plastid organellar blast -
- Run FCS Adaptor -
- Run FCS-GX -
- Pacbio Barcode Check -
- Run Read Coverage -
- Run Vecscreen -
- Run NT Kraken -
- Nucleotide Diamond Blast -
- Uniprot Diamond Blast -
- Create BTK dataset -
- Autofilter and check assembly -
- Generate samplesheet -
- Sanger-TOL BTK -
- Merge BTK datasets -
- ASCC Merge Tables -
- Pipeline information - Report metrics generated during the workflow execution
YamlInput
Output files
NA
Output files
NA
YamlInput parses the input yaml into channels for later use in the pipeline.
Validate TaxID
Output files
NA
Output files
NA
Validate TaxID scans through the taxdump to ensure that the input taxid is present in the nxbi taxdump.
Filter Fasta
Output files
filter/
*filtered.fasta
- A fasta file that has been filtered for sequences below a given threshold.
Output files
filter/
*filtered.fasta
- A fasta file that has been filtered for sequences below a given threshold.
By default scaffolds above 1.9Gb are removed from the assembly, as scaffolds of this size are unlikely to truely have contamination. There is also the issue that scaffolds larger than this use a significant amount of resources which hinders production environments.
GC Content
Output files
gc/
*-GC_CONTENT.txt
- A text file describing the GC content of the input genome.
Output files
gc/
*-GC_CONTENT.txt
- A text file describing the GC content of the input genome.
Calculating the GC content of the input genome.
Generate Genome
Output files
generate/
*.genome
- An index-like file describing the input genome.
Output files
generate/
*.genome
- An index-like file describing the input genome.
An index-like file containing the scaffold and scaffold length of the input genome.
Trailing Ns Check
Output files
trailingns/
*_trim_Ns
- A text file containing a report of the Ns found in the genome.
Output files
trailingns/
*_trim_Ns
- A text file containing a report of the Ns found in the genome.
A text file containing a report of the Ns found in the genome.
Get KMERS profile
Output files
get/
*_KMER_COUNTS.csv
- A csv file containing kmers and their counts.
Output files
get/
*_KMER_COUNTS.csv
- A csv file containing kmers and their counts.
A csv file containing kmers and their counts.
Extract Tiara Hits
Output files
tiara/
*.{txt,txt.gz}
- A text file containing classifications of potential contaminants.
log_*.{txt,txt.gz}
- A log of the tiara run.
*.{fasta,fasta.gz}
- An output fasta file.
Output files
tiara/
*.{txt,txt.gz}
- A text file containing classifications of potential contaminants.log_*.{txt,txt.gz}
- A log of the tiara run.*.{fasta,fasta.gz}
- An output fasta file.
Tiara ...
Mito Organellar Blast
Output files
blast/
*.tsv
- A tsv file containing potential contaminants.
Output files
blast/
*.tsv
- A tsv file containing potential contaminants.
A BlastN based subworkflow used on the input genome to filter potential contaminants from the genome.
Chloro Organellar Blast
Output files
blast/
*.tsv
- A tsv file containing potential contaminants.
Output files
blast/
*.tsv
- A tsv file containing potential contaminants.
A BlastN based subworkflow used on the input genome to filter potential contaminants from the genome.
Run FCS Adaptor
Output files
fcs/
*.fcs_adaptor_report.txt
- A text file containing potential adaptor sequences and locations.
*.cleaned_sequences.fa.gz
- Cleaned fasta file.
*.fcs_adaptor.log
- Log of the fcs run.
*.pipeline_args.yaml
- Arguments to FCS Adaptor
*.skipped_trims.jsonl
- Skipped sequences
Output files
fcs/
*.fcs_adaptor_report.txt
- A text file containing potential adaptor sequences and locations.*.cleaned_sequences.fa.gz
- Cleaned fasta file.*.fcs_adaptor.log
- Log of the fcs run.*.pipeline_args.yaml
- Arguments to FCS Adaptor*.skipped_trims.jsonl
- Skipped sequences
FCS Adaptor Identified potential locations of retained adaptor sequences from the sequencing run.
Run FCS-GX
Output files
fcs/
*out/*.fcs_gx_report.txt
- A text file containing potential contaminant locations.
out/*.taxonomy.rpt
- Taxonomy report of the potential contaminants.
Output files
fcs/
*out/*.fcs_gx_report.txt
- A text file containing potential contaminant locations.out/*.taxonomy.rpt
- Taxonomy report of the potential contaminants.
FCS-GX Identified potential locations of contaminant sequences.
Pacbio Barcode Check
Output files
filter/
*_filtered.txt
- Text file of barcodes found in the genome.
Output files
filter/
*_filtered.txt
- Text file of barcodes found in the genome.
Uses BlastN to identify where given barcode sequences may be in the genome.
Run Read Coverage
Output files
samtools/
*.bam
- Aligned BAM file.
*_average_coverage.txt
- Text file containing the coverage information for the genome
Output files
samtools/
*.bam
- Aligned BAM file.*_average_coverage.txt
- Text file containing the coverage information for the genome
Mapping the read data to the input genome and calculating the average coverage across it.
Run Vecscreen
Output files
summarise/
*.vecscreen_contamination
- A text file containing potential vector contaminant locations.
Output files
summarise/
*.vecscreen_contamination
- A text file containing potential vector contaminant locations.
Vecscreen identifies vector contamination in the input sequence.
Run NT Kraken
Output files
kraken2/
*.classified{.,_}*'
- Fastq file containing classified sequence.
*.unclassified{.,_}*'
- Fastq file containing unclassified sequence.
*classifiedreads.txt
- A text file containing a report on reads which have been classified.
*report.txt
- Report of Kraken2 run.
get/
*txt
- Text file containing lineage information of the reported meta genomic data.
Output files
kraken2/
*.classified{.,_}*'
- Fastq file containing classified sequence.*.unclassified{.,_}*'
- Fastq file containing unclassified sequence.*classifiedreads.txt
- A text file containing a report on reads which have been classified.*report.txt
- Report of Kraken2 run.get/
*txt
- Text file containing lineage information of the reported meta genomic data.
Kraken assigns taxonomic labels to metagenomic DNA sequences and optionally outputs the fastq of these data.
Nucleotide Diamond Blast
Output files
diamond/
*.txt
- A text file containing the genomic locations of hits and scores.
reformat/
*text
- A Reformated text file continaing the full genomic location of hits and scores.
convert/
*.hits
- A file containing all hits above the cutoff.
Output files
diamond/
*.txt
- A text file containing the genomic locations of hits and scores.reformat/
*text
- A Reformated text file continaing the full genomic location of hits and scores.convert/
*.hits
- A file containing all hits above the cutoff.
Diamond Blast is a sequence aligner for translated and protein sequences, here it is used do identify contamination usin the NCBI db
Uniprot Diamond Blast
Output files
diamond/
*.txt
- A text file containing the genomic locations of hits and scores.
reformat/
*text
- A Reformated text file continaing the full genomic location of hits and scores.
convert/
*.hits
- A file containing all hits above the cutoff.
Output files
diamond/
*.txt
- A text file containing the genomic locations of hits and scores.reformat/
*text
- A Reformated text file continaing the full genomic location of hits and scores.convert/
*.hits
- A file containing all hits above the cutoff.
Diamond Blast is a sequence aligner for translated and protein sequences, here it is used do identify contamination usin the Uniprot db
Create BTK dataset
Output files
create/
btk_datasets/
- A btk dataset folder containing data compatible with BTK viewer.
btk_summary_table_full.tsv
- A TSV file summarising the dataset.
Output files
create/
btk_datasets/
- A btk dataset folder containing data compatible with BTK viewer.btk_summary_table_full.tsv
- A TSV file summarising the dataset.
Create BTK, creates a BTK_dataset folder compatible with BTK viewer.
Autofilter and check assembly
Output files
autofilter/
autofiltered.fasta
- The decontaminated input genome.
ABNORMAL_CHECK.csv
- Combined FCS and Tiara summary of contamination.
assembly_filtering_removed_sequences.txt
- Sequences deemed contamination and removed from the above assembly.
fcs-gx_alarm_indicator_file.txt
- Contains text to control the running of Blobtoolkit.
Output files
autofilter/
autofiltered.fasta
- The decontaminated input genome.ABNORMAL_CHECK.csv
- Combined FCS and Tiara summary of contamination.assembly_filtering_removed_sequences.txt
- Sequences deemed contamination and removed from the above assembly.fcs-gx_alarm_indicator_file.txt
- Contains text to control the running of Blobtoolkit.
Autofilter and check assembly returns a decontaminated genome file as well as summaries of the contamination found.
Generate samplesheet
Output files
generate/
*.csv
- A CSV file containing data locations, for use in Blobtoolkit.
Output files
generate/
*.csv
- A CSV file containing data locations, for use in Blobtoolkit.
This produces a CSV containing information on the read data for use in BlobToolKit.
Sanger-TOL BTK
Output files
sanger/
*_btk_out/blobtoolkit/${meta.id}*/
- The BTK dataset folder generated by BTK.
*_btk_out/blobtoolkit/plots/
- The plots for display in BTK Viewer.
*_btk_out/blobtoolkit/${meta.id}*/summary.json.gz
- The Summary.json file...
*_btk_out/busco/*
- The BUSCO results returned by BTK.
*_btk_out/multiqc/*
- The MultiQC results returned by BTK.
blobtoolkit_pipeline_info
- The pipeline_info folder.
Output files
sanger/
*_btk_out/blobtoolkit/${meta.id}*/
- The BTK dataset folder generated by BTK.*_btk_out/blobtoolkit/plots/
- The plots for display in BTK Viewer.*_btk_out/blobtoolkit/${meta.id}*/summary.json.gz
- The Summary.json file...*_btk_out/busco/*
- The BUSCO results returned by BTK.*_btk_out/multiqc/*
- The MultiQC results returned by BTK.blobtoolkit_pipeline_info
- The pipeline_info folder.
Sanger-Tol/BlobToolKit is a Nextflow re-implementation of the snakemake based BlobToolKit pipeline and produces interactive plots used to identify true contamination and seperate sequence from the main assembly.
Merge BTK datasets
Output files
merge/
merged_datasets
- A BTK dataset.
merged_datasets/btk_busco_summary_table_full.tsv
- A TSV file containing a summary of the btk busco results.
Output files
merge/
merged_datasets
- A BTK dataset.merged_datasets/btk_busco_summary_table_full.tsv
- A TSV file containing a summary of the btk busco results.
This module merged the Create_btk_dataset folder with the Sanger-tol BTK dataset to create one unified dataset for use with btk viewer.
ASCC Merge Tables
Output files
ascc/
*_contamination_check_merged_table.csv
- ....
*_contamination_check_merged_table_extended.csv
- ....
*_phylum_counts_and_coverage.csv
- A CSV report containing information on the hits per phylum and the coverage of the hits..
Output files
ascc/
*_contamination_check_merged_table.csv
- ....*_contamination_check_merged_table_extended.csv
- ....*_phylum_counts_and_coverage.csv
- A CSV report containing information on the hits per phylum and the coverage of the hits..
Merge Tables merged the summary reports from a number of modules inorder to create a single set of reports.
Pipeline information
Output files
pipeline_info/
- Reports generated by Nextflow:
execution_report.html
, execution_timeline.html
, execution_trace.txt
and pipeline_dag.dot
/pipeline_dag.svg
.
- Reports generated by the pipeline:
pipeline_report.html
, pipeline_report.txt
and software_versions.yml
. The pipeline_report*
files will only be present if the --email
/ --email_on_fail
parameter's are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
.
Output files
pipeline_info/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter's are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.