Introduction
This document describes the output produced by the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
The directories comply with Tree of Life's canonical directory structure.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- Quality control - Check quality of input reads
- Preprocessing
- Filtering – Filtering PacBio data before alignment
- Alignment and Mark duplicates
- Output options – Output options for all read types
- Short reads – Aligning HiC and Illumina reads using BWAMEM2
- Oxford Nanopore reads – Aligning ONT reads using MINIMAP2
- PacBio reads – Aligning PacBio CLR and CCS filtered reads using MINIMAP2
- Alignment post-processing
- Statistics – Alignment statistics
- Workflow reporting and genomes
- Reference genome files - Reference genome indices/files
- Pipeline information - Report metrics generated during the workflow execution
- MultiQC report – Combined input/output QC summary (
*.html)
Preprocessing
Quality Control
Quality Control
Input files undergo quality assessment using FASTQC, a widely-used tool for evaluating raw sequencing data. If the input is in CRAM format, it is first converted to FASTQ format to enable compatibility with FASTQC.
Output files
Quality_control*_fastqc.html: An interactive HTML report summarizing key read quality metrics*_fastqc.zip: A compressed archive containing the full set of FASTQC output files, including raw data and plots, suitable for further automated parsing or archival.
Filtering
PacBio reads generated using both CLR and CCS technology are filtered using BLAST_BLASTN against a database of adapter sequences. The collated FASTQ of the filtered reads is required by the downstream alignment step. The results from the PacBio filtering subworkflow are currently not set to output.
Alignment and Mark duplicates
Output options
- outfmt: Specifies the output format for alignments. It can be set to "bam", "cram", or both, separated by a comma (e.g.,
--outfmt bam,cram). The pipeline will generate output files in the specified formats.
- compression: Specifies the compression method for alignments. It can be set to "none" or "crumble". When set to "crumble", the pipeline compresses the quality scores of the alignments.
- merge_output: Merge output at the individual level. If merge_output is enabled (default: false), both unmerged and merged output files per sample will be generated; otherwise, only unmerged files are exported.
Short reads
Output options
- outfmt: Specifies the output format for alignments. It can be set to "bam", "cram", or both, separated by a comma (e.g.,
--outfmt bam,cram). The pipeline will generate output files in the specified formats. - compression: Specifies the compression method for alignments. It can be set to "none" or "crumble". When set to "crumble", the pipeline compresses the quality scores of the alignments.
- merge_output: Merge output at the individual level. If merge_output is enabled (default: false), both unmerged and merged output files per sample will be generated; otherwise, only unmerged files are exported.
Short reads
Short read data from HiC and Illumina technologies is aligned with BWAMEM2_MEM. The sorted and merged alignment files are processed using the SAMTOOLS mark-duplicate workflow. The marked duplicate alignments are output in the CRAM or BAM format, along with the index.
Output files
read_mappinghicmerged<gca_accession>.unmasked.hic.<sample_id>.[cr|b]am: Sorted and merged BAM or CRAM file at the individual level<gca_accession>.unmasked.hic.<sample_id>.[cr|b]am.[cr|c]si: Index for the alignment (as either .csi or .crai)
<gca_accession>.unmasked.hic.<sample_id><t1>.[cr|b]am: Unmerged sorted BAM or CRAM<gca_accession>.unmasked.hic.<sample_id><t1>.[cr|b]am.[cr|c]si: Index for the alignment (as either .csi or .crai)
illuminamerged<gca_accession>.unmasked.hic.<sample_id>.[cr|b]am: Sorted and merged BAM or CRAM file at the individual level<gca_accession>.unmasked.hic.<sample_id>.[cr|b]am.[cr|c]si: Index for the alignment (as either .csi or .crai)
<gca_accession>.unmasked.hic.<sample_id>_T<number>.[cr|b]am: Unmerged BAM or CRAM.T<number is sample identifier with occurrence number i.e. t1 indicates the first of that name t2 second occurrence.>.unmasked.hic.<sample_id>_T<number>.[cr|b]am.[cr|c]si: Corresponding index for the alignment (as either .csi or .crai)
Oxford Nanopore reads
Reads generated using Oxford Nanopore technology are aligned with MINIMAP2_ALIGN. The sorted and merged alignment is output in the CRAM or BAM format, along with the index.
Output files
read_mappingontmerged<gca_accession>.unmasked.ont.<sample_id>.[cr|b]am: Sorted and merged BAM or CRAM file at the individual level<gca_accession>.unmasked.ont.<sample_id>.[cr|b]am.[cr|c]si: Index for the alignment (as either .csi or .crai)
<gca_accession>.unmasked.ont.<sample_id>_T<number>.[cr|b]am: Unmerged BAM or CRAM.T<number is sample identifier with occurrence number i.e. t1 indicates the first of that name t2 second occurrence.>.unmasked.ont.<sample_id>_T<number>.[cr|b]am.[cr|c]si: Corresponding index for the alignment (as either .csi or .crai)
PacBio reads
The filtered PacBio reads are aligned with MINIMAP2_ALIGN. The sorted and merged alignment is output in the CRAM or BAM format, along with the index.
Output files
read_mappingpacbiomerged<gca_accession>.unmasked.pacbio.<sample_id>.[cr|b]am: Sorted and merged BAM or CRAM file at the individual level<gca_accession>.unmasked.pacbio.<sample_id>.[cr|b]am.[cr|c]si: Index for the alignment (as either .csi or .crai)
<gca_accession>.unmasked.pacbio.<sample_id>_T<number>.[cr|b]am: Unmerged BAM or CRAM.T<number is sample identifier with occurrence number i.e. t1 indicates the first of that name t2 second occurrence.>.unmasked.pacbio.<sample_id>_T<number>.[cr|b]am.[cr|c]si: Corresponding index for the alignment (as either .csi or .crai)
Alignment post-processing
External metadata
External metadata
If provided using the --header option, all output alignments (*.cram or *.bam) will include any additional metadata supplied as a SAM header template, replacing the existing @HD and @SD entries (note that this behaviour can be altered by modifying the ext.args for SAMTOOLS_REHEADER in modules.config).
Read coverage
Read coverage of the output alignment file is calculated with blobtk depth.
Output files
read_mapping<sequence-type><gca_accession>.unmasked.<sequence-type>.<sample_id>.[cr|b]am.coverage.bedGraph.gz: Read coverage in bedGraph format
Statistics
The output alignments, along with the index, are used to calculate mapping statistics. Output files are generated using SAMTOOLS_STATS, SAMTOOLS_FLAGSTAT and SAMTOOLS_IDXSTATS.
Output files
read_mappinghicmerged<gca_accession>.unmasked.hic.<sample_id>.stats.gz: Comprehensive statistics from merged alignment file<gca_accession>.unmasked.hic.<sample_id>.flagstat: Number of merged alignments for each FLAG type<gca_accession>.unmasked.hic.<sample_id>.idxstats: Merged alignment summary statistics
<gca_accession>.unmasked.hic.<sample_id>_T<number>.stats.gz: Comprehensive statistics from each alignment file<gca_accession>.unmasked.hic.<sample_id>_T<number>.flagstat: Number of alignments for each FLAG type<gca_accession>.unmasked.hic.<sample_id>_T<number>.idxstats: Alignment summary statistics
ontmerged<gca_accession>.unmasked.ont.<sample_id>.stats.gz: Comprehensive statistics from merged alignment file<gca_accession>.unmasked.ont.<sample_id>.flagstat: Number of alignments for each FLAG type<gca_accession>.unmasked.ont.<sample_id>.idxstats: Merged alignment summary statistics
<gca_accession>.unmasked.ont.<sample_id>_T<number>.stats.gz: Comprehensive statistics from each alignment file<gca_accession>.unmasked.ont.<sample_id>_T<number>.flagstat: Number of alignments for each FLAG type<gca_accession>.unmasked.ont.<sample_id>_T<number>.idxstats: Alignment summary statistics
pacbiomerged<gca_accession>.unmasked.pacbio.<sample_id>.stats.gz: Comprehensive statistics from alignment file<gca_accession>.unmasked.pacbio.<sample_id>.flagstat: Number of merged alignments for each FLAG type<gca_accession>.unmasked.pacbio.<sample_id>.idxstats: Merged alignment summary statistics
<gca_accession>.unmasked.pacbio.<sample_id>_T<number>.stats.gz: Comprehensive statistics from each alignment file<gca_accession>.unmasked.pacbio.<sample_id>_T<number>.flagstat: Number of alignments for each FLAG type<gca_accession>.unmasked.pacbio.<sample_id>_T<number>.idxstats: Alignment summary statistics
Workflow reporting and genomes
Reference genome files
Reference genome files
A number of genome-specific files are generated by the pipeline because they are required for the downstream processing of the results. These include an unmasked version of the genome by process UNMASK and an index by BWAMEM2_INDEX. They are currently not set to output.
Pipeline information
Output files
pipeline_info/readmapping/
- Reports generated by Nextflow:
execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline:
pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter's are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv.
MultiQC report
The workflow generates a MultiQC summary report that aggregates and visualises statistics (e.g., FastQC, alignment statistics).
Quality_control
*multiqc_report.html: An interactive HTML report summarizing key quality metrics
Output files
pipeline_info/readmapping/- Reports generated by Nextflow:
execution_report.html,execution_timeline.html,execution_trace.txtandpipeline_dag.dot/pipeline_dag.svg. - Reports generated by the pipeline:
pipeline_report.html,pipeline_report.txtandsoftware_versions.yml. Thepipeline_report*files will only be present if the--email/--email_on_failparameter's are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv.
- Reports generated by Nextflow:
MultiQC report
The workflow generates a MultiQC summary report that aggregates and visualises statistics (e.g., FastQC, alignment statistics).
Quality_control*multiqc_report.html: An interactive HTML report summarizing key quality metrics
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.