Edit

sanger-tol/
variantcalling

Nextflow DSL2 pipeline to call variants on long read alignment.

https://github.com/sanger-tol/variantcalling

Introduction

This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Read Alignments - Optional aligned CRAM files generated by minimap2
Alignments Statistics - Optional statistics files generated by samtools
PacBio Variant Calling - VCF and GVCF compressed files generated by DeepVariant
Pipeline information - Report metrics generated during the workflow execution

Read Alignments

The unaligned PacBio read data is being filtered and aligned using minimap2. Reads from the same specimen (across multiple runs) will be merged into CRAM files.

Output files

read_mapping
- pacbio
  - / (can be nested directories if `sample` contains `/`) - Aligned CRAM files: `.pacbio..minimap2.cram`. (`/` in the sample name are replaced with `.`) - Aligned CRAM index files: `.pacbio..minimap2.cram.crai`. (`/` in the sample name are replaced with `.`)

Alignments Statistics

The statistics for the aligned CRAM files will be calculated using samtools.

Output files

read_mapping/
- pacbio/
  - / (can be nested directories if `sample` contains `/`) - stats/ - Comprehensive statistics from alignment file: `.pacbio..minimap2.stats`. - Number of alignments for each FLAG type: `.pacbio..minimap2.flagstat`. - Alignment summary statistics: `.pacbio..minimap2.idxstats`.

PacBio Variant Calling

The aligned PacBio read data is used to call variants with DeepVariant. This is done by splitting the genome fasta file for speed efficiency. BCFTOOLS is used to combine the split VCF and GVCF files generated by DEEPVARIANT.

Output files

variant_analysis/
- pacbio/
  - / (can be nested directories if `sample` contains `/`) - Compressed VCF files: `.pacbio..minimap2.deepvariant.vcf.gz`. ( `/` in the sample name are replaced with `.`, when alignment is skipped, output VCF basename is derived from input basename) - Index of compressed VCF files: `.pacbio..minimap2.deepvariant.vcf.gz.[tbi|csi]`. - Compressed GVCF files: `.pacbio..minimap2.deepvariant.g.vcf.gz`. - Index of compressed GVCF files: `.pacbio..minimap2.deepvariant.g.vcf.gz.[tbi|csi]`. - `qc` - HTML files: `.pacbio..minimap2.deepvariant.[vcf|g.vcf].stats.visual_report.html`.

Pipeline information

Output files

pipeline_info/
- Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter's are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
- Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.