Edit

sanger-tol/
sequencecomposition

Nextflow DSL2 pipeline to extract statistics from a genome about its sequence composition

genomics kmer-frequency-count running-average shannon-entropy

Launch version 1.2.0

https://github.com/sanger-tol/sequencecomposition

Introduction

This document describes the output produced by the pipeline.

The directories comply with Tree of Life's canonical directory structure.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Sequence composition files - Files with various statistics about sequence composition
Pipeline information - Report metrics generated during the workflow execution

Outputs are deposited in the main output directory (--outdir) and under the per-genome output directory described in the samplesheet. Both are combined with the current working directory under the usual absolute-vs-relative rules for joining directories.

Sequence composition files

Here are the files you can expect in the output directory.

base_content
├── k1
│   ├── <genome_name>.A.1k.bedGraph.gz
│   ├── <genome_name>.A.1k.bedGraph.gz.csi
│   ├── <genome_name>.A.1k.bedGraph.gz.tbi
│   ├── <genome_name>.AT_skew.1k.bedGraph.gz
│   ├── <genome_name>.AT_skew.1k.bedGraph.gz.csi
│   ├── <genome_name>.AT_skew.1k.bedGraph.gz.tbi
│   ├── <genome_name>.C.1k.bedGraph.gz
│   ├── <genome_name>.C.1k.bedGraph.gz.csi
│   ├── <genome_name>.C.1k.bedGraph.gz.tbi
│   ├── <genome_name>.G.1k.bedGraph.gz
│   ├── <genome_name>.G.1k.bedGraph.gz.csi
│   ├── <genome_name>.G.1k.bedGraph.gz.tbi
│   ├── <genome_name>.GC.1k.bedGraph.gz
│   ├── <genome_name>.GC.1k.bedGraph.gz.csi
│   ├── <genome_name>.GC.1k.bedGraph.gz.tbi
│   ├── <genome_name>.GC_skew.1k.bedGraph.gz
│   ├── <genome_name>.GC_skew.1k.bedGraph.gz.csi
│   ├── <genome_name>.GC_skew.1k.bedGraph.gz.tbi
│   ├── <genome_name>.mononuc.1k.tsv.gz
│   ├── <genome_name>.mononuc.1k.tsv.gz.csi
│   ├── <genome_name>.mononuc.1k.tsv.gz.tbi
│   ├── <genome_name>.N.1k.bedGraph.gz
│   ├── <genome_name>.N.1k.bedGraph.gz.csi
│   ├── <genome_name>.N.1k.bedGraph.gz.tbi
│   ├── <genome_name>.nucShannon.1k.bedGraph.gz
│   ├── <genome_name>.nucShannon.1k.bedGraph.gz.csi
│   ├── <genome_name>.nucShannon.1k.bedGraph.gz.tbi
│   ├── <genome_name>.T.1k.bedGraph.gz
│   ├── <genome_name>.T.1k.bedGraph.gz.csi
│   └── <genome_name>.T.1k.bedGraph.gz.tbi
├── k2
│   ├── <genome_name>.CpG.1k.bedGraph.gz
│   ├── <genome_name>.CpG.1k.bedGraph.gz.csi
│   ├── <genome_name>.CpG.1k.bedGraph.gz.tbi
│   ├── <genome_name>.dinuc.1k.tsv.gz
│   ├── <genome_name>.dinuc.1k.tsv.gz.csi
│   ├── <genome_name>.dinuc.1k.tsv.gz.tbi
│   ├── <genome_name>.dinucShannon.1k.bedGraph.gz
│   ├── <genome_name>.dinucShannon.1k.bedGraph.gz.csi
│   └── <genome_name>.dinucShannon.1k.bedGraph.gz.tbi
├── k3
│   ├── <genome_name>.trinuc.1k.tsv.gz
│   ├── <genome_name>.trinuc.1k.tsv.gz.csi
│   ├── <genome_name>.trinuc.1k.tsv.gz.tbi
│   ├── <genome_name>.trinucShannon.1k.bedGraph.gz
│   ├── <genome_name>.trinucShannon.1k.bedGraph.gz.csi
│   └── <genome_name>.trinucShannon.1k.bedGraph.gz.tbi
└── k4
    ├── <genome_name>.tetranuc.1k.tsv.gz
    ├── <genome_name>.tetranuc.1k.tsv.gz.csi
    ├── <genome_name>.tetranuc.1k.tsv.gz.tbi
    ├── <genome_name>.tetranucShannon.1k.bedGraph.gz
    ├── <genome_name>.tetranucShannon.1k.bedGraph.gz.csi
    └── <genome_name>.tetranucShannon.1k.bedGraph.gz.tbi

where <genome_name> is the name of the input genome file.

All outputs are in bedGraph and TSV (BED3+) formats, compressed with bgzip and indexed with tabix (.csi and .tbi indices).

For each k from 1 to 4, the k-mer counts are in k${k}/<genome_name>.*nuc.1k.tsv.gz, and the resulting Shannon diversity metrics in k${k}/<genome_name>.*nucShannon.1k.bedGraph.gz.

Additionally, these frequencies are extracted in bedGraph files:

each nucleotide, and N
GC content, GC skew, and AT skew
CpG

Pipeline information

Output files

pipeline_info/
- Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter's are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
- Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.