Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

  • GFASTATS - Collect statistics on the curated primary assembly
  • MERQURYFK - Generate kmer plots for the curated assembly using previous run information
  • SANGER_TOL_BTK - Run Blobtoolkit to generate plots and short_summary.txt from BUSCO.
  • SANGER_TOL_CPRETEXT - Run Curationpretext to generate Pretext files and accessory tracks.
  • Pipeline information - Report metrics generated during the workflow execution

GFASTATS

Output files
  • gfastats/
    • *.assembly.summary: Assembly metrics of the input primary file.
    • *_fasta.gz: GZipped primary assembly file.

GFASTATS is a single fast and exhaustive tool for summary statistics and simultaneous fa (fasta, fastq, gfa [.gz]) genome assembly file manipulation.

MERQURYFK

Output files
  • merquryfk/
    • *.completeness.stats:
    • *{"primary","haplotype",""}_only.bed:
    • *{"primary","haplotype",""}.qv:
    • *.spectra-asm.{fl,ln,st}.png:
    • *{"primary","haplotype"}.spectra-cn.{fl,ln,st}.png:

MERQURYFK is a FastK based version of Merqury.

Merqury is a novel tool for reference-free assembly evaluation based on efficient k-mer set operations. By comparing k-mers in a de novo assembly to those found in unassembled high-accuracy reads, Merqury estimates base-level accuracy and completeness.

SANGER_TOL_BTK

Output files
  • sanger/*_blobtoolkit_out/
    • blobtoolkit/plots/*png: Blobtoolkit plots
    • blobtoolkit/{ASSEMBLY_NAME}/*.json.gz: Blobtoolkit dataset for use in BTK_viewer.
    • busco/*_odb10/*.{tsv,tar.gz,json,txt}: Busco output
    • muliqc/: MultiQC plots/data and report.html.
    • pipeline_info

SANGER_TOL_BTK is a bioinformatics pipeline that can be used to identify and analyse non-target DNA for eukaryotic genomes.

SANGER_TOL_CPRETEXT

Output files
  • sanger/*_curationpretext_out/
    • accessory_files/*.{bigWig,bed,bedgraph}: Track files describing Telomere, gap, coverage data across the genome.
    • pretext_maps_raw: Pre-accessory file ingestion pretext files.
    • pretext_maps_processed: Post-accessory file ingestion pretext files, e.g. the final output.
    • pipeline_info

SANGER_TOL_CPRETEXT is a bioinformatics pipeline typically used in conjunction with TreeVal to generate pretext maps (and optionally telomeric, gap, coverage, and repeat density plots which can be ingested into pretext) for the manual curation of high quality genomes.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter's are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
    • Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.