Introduction
This document describes the output produced by the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
The directories comply with Tree of Life's canonical directory structure.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- Sequence composition files - Files with various statistics about sequence composition
- Pipeline information - Report metrics generated during the workflow execution
Sequence composition files
Here are the files you can expect in the analysis/
sub-directory.
analysis
└── gfLaeSulp1.1
└── base_content
├── k1
│ ├── GCA_927399515.1.A.1k.bedGraph.gz
│ ├── GCA_927399515.1.A.1k.bedGraph.gz.csi
│ ├── GCA_927399515.1.A.1k.bedGraph.gz.tbi
│ ├── GCA_927399515.1.AT_skew.1k.bedGraph.gz
│ ├── GCA_927399515.1.AT_skew.1k.bedGraph.gz.csi
│ ├── GCA_927399515.1.AT_skew.1k.bedGraph.gz.tbi
│ ├── GCA_927399515.1.C.1k.bedGraph.gz
│ ├── GCA_927399515.1.C.1k.bedGraph.gz.csi
│ ├── GCA_927399515.1.C.1k.bedGraph.gz.tbi
│ ├── GCA_927399515.1.G.1k.bedGraph.gz
│ ├── GCA_927399515.1.G.1k.bedGraph.gz.csi
│ ├── GCA_927399515.1.G.1k.bedGraph.gz.tbi
│ ├── GCA_927399515.1.GC.1k.bedGraph.gz
│ ├── GCA_927399515.1.GC.1k.bedGraph.gz.csi
│ ├── GCA_927399515.1.GC.1k.bedGraph.gz.tbi
│ ├── GCA_927399515.1.GC_skew.1k.bedGraph.gz
│ ├── GCA_927399515.1.GC_skew.1k.bedGraph.gz.csi
│ ├── GCA_927399515.1.GC_skew.1k.bedGraph.gz.tbi
│ ├── GCA_927399515.1.mononuc.1k.tsv.gz
│ ├── GCA_927399515.1.mononuc.1k.tsv.gz.csi
│ ├── GCA_927399515.1.mononuc.1k.tsv.gz.tbi
│ ├── GCA_927399515.1.N.1k.bedGraph.gz
│ ├── GCA_927399515.1.N.1k.bedGraph.gz.csi
│ ├── GCA_927399515.1.N.1k.bedGraph.gz.tbi
│ ├── GCA_927399515.1.nucShannon.1k.bedGraph.gz
│ ├── GCA_927399515.1.nucShannon.1k.bedGraph.gz.csi
│ ├── GCA_927399515.1.nucShannon.1k.bedGraph.gz.tbi
│ ├── GCA_927399515.1.T.1k.bedGraph.gz
│ ├── GCA_927399515.1.T.1k.bedGraph.gz.csi
│ └── GCA_927399515.1.T.1k.bedGraph.gz.tbi
├── k2
│ ├── GCA_927399515.1.CpG.1k.bedGraph.gz
│ ├── GCA_927399515.1.CpG.1k.bedGraph.gz.csi
│ ├── GCA_927399515.1.CpG.1k.bedGraph.gz.tbi
│ ├── GCA_927399515.1.dinuc.1k.tsv.gz
│ ├── GCA_927399515.1.dinuc.1k.tsv.gz.csi
│ ├── GCA_927399515.1.dinuc.1k.tsv.gz.tbi
│ ├── GCA_927399515.1.dinucShannon.1k.bedGraph.gz
│ ├── GCA_927399515.1.dinucShannon.1k.bedGraph.gz.csi
│ └── GCA_927399515.1.dinucShannon.1k.bedGraph.gz.tbi
├── k3
│ ├── GCA_927399515.1.trinuc.1k.tsv.gz
│ ├── GCA_927399515.1.trinuc.1k.tsv.gz.csi
│ ├── GCA_927399515.1.trinuc.1k.tsv.gz.tbi
│ ├── GCA_927399515.1.trinucShannon.1k.bedGraph.gz
│ ├── GCA_927399515.1.trinucShannon.1k.bedGraph.gz.csi
│ └── GCA_927399515.1.trinucShannon.1k.bedGraph.gz.tbi
└── k4
├── GCA_927399515.1.tetranuc.1k.tsv.gz
├── GCA_927399515.1.tetranuc.1k.tsv.gz.csi
├── GCA_927399515.1.tetranuc.1k.tsv.gz.tbi
├── GCA_927399515.1.tetranucShannon.1k.bedGraph.gz
├── GCA_927399515.1.tetranucShannon.1k.bedGraph.gz.csi
└── GCA_927399515.1.tetranucShannon.1k.bedGraph.gz.tbi
They all correspond to the various results of the pipelines. Following the convention,
the directory structure includes the assembly name, e.g. gfLaeSulp1.1
, and all files are named after the assembly accession, e.g. GCA_927399515.1
.
All outputs are in bedGraph and TSV (BED3+) formats, compressed with bgzip
and indexed with tabix
(.csi
and .tbi
indices).
For each k from 1 to 4, the k-mer counts are in k${k}/GCA_*.*nuc.1k.tsv.gz
,
and the resulting Shannon diversity metrics in k${k}/GCA_*.*nucShannon.1k.bedGraph.gz
.
Additionally, these frequencies are extracted in bedGraph files:
- each nucleotide, and N
- GC content, GC skew, and AT skew
- CpG
Pipeline information
pipeline_info/sequencecomposition/
- Reports generated by Nextflow:
execution_report.html
, execution_timeline.html
, execution_trace.txt
and pipeline_dag.dot
/pipeline_dag.svg
.
- Reports generated by the pipeline:
pipeline_report.html
, pipeline_report.txt
and software_versions.yml
. The pipeline_report*
files will only be present if the --email
/ --email_on_fail
parameter's are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
.
pipeline_info/sequencecomposition/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter's are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
.
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.