Introduction

This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Readmapping Alignments

The unaligned PacBio read data is being filtered and aligned using minimap2. The CRAM files from the same sample will be merged.

Output files
  • readmapping
    • Aligned CRAM files: <fasta_name>.pacbio.<sample_name>.cram.
    • Aligned CRAM index files: <fasta_name>.pacbio.<sample_name>.cram.crai.

Alignments Statistics

The statistics for the aligned CRAM files will be calculated using samtools.

Output files
  • statistics
    • Comprehensive statistics from alignment file: <fasta_name>.pacbio.<sample_name>.stats.
    • Number of alignments for each FLAG type: <fasta_name>.pacbio.<sample_name>.flagstats.
    • Alignment summary statistics: <fasta_name>.pacbio.<sample_name>.idxstats.

VCFtools Processing

Output files
  • Heterozygosity generated by VCFtools: <fasta_name>.pacbio.<sample_name>_deepvariant.vcf.het.
  • Per site nucleotide diversity calculated by VCFtools: <fasta_name>.pacbio.<sample_name>_deepvariant.vcf.sites.pi.

PacBio Variant Calling

The aligned PacBio read data is used to call variants with DeepVariant. This is done by splitting the genome fasta file for speed efficiency. BCFTOOLS is used to combine the split VCF and GVCF files generated by DEEPVARIANT.

Output files
  • variant_calling
    • Compressed VCF files: <fasta_name>.pacbio.<sample_name>_deepvariant.vcf.gz.
    • Compressed GVCF files: <fasta_name>.pacbio.<sample_name>_deepvariant.g.vcf.gz.

Pipeline information

Output files
  • pipeline_info/variantcalling/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter's are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.