Edit

sanger-tol/
variantcalling

Nextflow DSL2 pipeline to call variants on long read alignment.

These pages are for an old version of the pipeline (v1.1.3). The latest stable release is v2.0.2

https://github.com/sanger-tol/variantcalling

Introduction

This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Readmapping Alignments - Optional aligned CRAM files generated by minimap2
Alignments Statistics Optional statistics files generated by samtools
VCFtools Processing Heterozygosity and per site nucleotide diversity calculated by VCFtools
PacBio Variant Calling - VCF and GVCF compressed files generated by DeepVariant
Pipeline information - Report metrics generated during the workflow execution

Readmapping Alignments

The unaligned PacBio read data is being filtered and aligned using minimap2. The CRAM files from the same sample will be merged.

Output files

readmapping
- Aligned CRAM files: <fasta_name>.pacbio.<sample_name>.cram.
- Aligned CRAM index files: <fasta_name>.pacbio.<sample_name>.cram.crai.

Alignments Statistics

The statistics for the aligned CRAM files will be calculated using samtools.

Output files

statistics
- Comprehensive statistics from alignment file: <fasta_name>.pacbio.<sample_name>.stats.
- Number of alignments for each FLAG type: <fasta_name>.pacbio.<sample_name>.flagstats.
- Alignment summary statistics: <fasta_name>.pacbio.<sample_name>.idxstats.

VCFtools Processing

Output files

Heterozygosity generated by VCFtools: <fasta_name>.pacbio.<sample_name>_deepvariant.vcf.het.
Per site nucleotide diversity calculated by VCFtools: <fasta_name>.pacbio.<sample_name>_deepvariant.vcf.sites.pi.

PacBio Variant Calling

The aligned PacBio read data is used to call variants with DeepVariant. This is done by splitting the genome fasta file for speed efficiency. BCFTOOLS is used to combine the split VCF and GVCF files generated by DEEPVARIANT.

Output files

variant_calling
- Compressed VCF files: <fasta_name>.pacbio.<sample_name>_deepvariant.vcf.gz.
- Compressed GVCF files: <fasta_name>.pacbio.<sample_name>_deepvariant.g.vcf.gz.

Pipeline information

Output files

pipeline_info/variantcalling/
- Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter's are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.