Introduction
This document describes the output produced by the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- Readmapping Alignments - Optional aligned CRAM files generated by minimap2
- Alignments Statistics Optional statistics files generated by samtools
- VCFtools Processing Heterozygosity and per site nucleotide diversity calculated by VCFtools
- PacBio Variant Calling - VCF and GVCF compressed files generated by DeepVariant
- Pipeline information - Report metrics generated during the workflow execution
Readmapping Alignments
The unaligned PacBio read data is being filtered and aligned using minimap2
. The CRAM files from the same sample will be merged.
Output files
readmapping
- Aligned CRAM files:
<fasta_name>.pacbio.<sample_name>.cram
. - Aligned CRAM index files:
<fasta_name>.pacbio.<sample_name>.cram.crai
.
- Aligned CRAM files:
Alignments Statistics
The statistics for the aligned CRAM files will be calculated using samtools
.
Output files
statistics
- Comprehensive statistics from alignment file:
<fasta_name>.pacbio.<sample_name>.stats
. - Number of alignments for each FLAG type:
<fasta_name>.pacbio.<sample_name>.flagstats
. - Alignment summary statistics:
<fasta_name>.pacbio.<sample_name>.idxstats
.
- Comprehensive statistics from alignment file:
VCFtools Processing
Output files
- Heterozygosity generated by VCFtools:
<fasta_name>.pacbio.<sample_name>_deepvariant.vcf.het
.
- Per site nucleotide diversity calculated by VCFtools:
<fasta_name>.pacbio.<sample_name>_deepvariant.vcf.sites.pi
.
PacBio Variant Calling
Output files
- Heterozygosity generated by VCFtools:
<fasta_name>.pacbio.<sample_name>_deepvariant.vcf.het
. - Per site nucleotide diversity calculated by VCFtools:
<fasta_name>.pacbio.<sample_name>_deepvariant.vcf.sites.pi
.
PacBio Variant Calling
The aligned PacBio read data is used to call variants with DeepVariant. This is done by splitting the genome fasta file for speed efficiency. BCFTOOLS
is used to combine the split VCF
and GVCF
files generated by DEEPVARIANT
.
Output files
variant_calling
- Compressed VCF files:
<fasta_name>.pacbio.<sample_name>_deepvariant.vcf.gz
. - Compressed GVCF files:
<fasta_name>.pacbio.<sample_name>_deepvariant.g.vcf.gz
.
- Compressed VCF files:
Pipeline information
Output files
pipeline_info/variantcalling/
- Reports generated by Nextflow:
execution_report.html
, execution_timeline.html
, execution_trace.txt
and pipeline_dag.dot
/pipeline_dag.svg
.
- Reports generated by the pipeline:
pipeline_report.html
, pipeline_report.txt
and software_versions.yml
. The pipeline_report*
files will only be present if the --email
/ --email_on_fail
parameter's are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
.
Output files
pipeline_info/variantcalling/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter's are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.