Introduction
This document describes the output produced by the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
The directories comply with Tree of Life's canonical directory structure.
Pipeline overview
Process overview
Process overview
The pipeline is built using Nextflow and processes data using the following steps:
- Quality control - Check quality of input reads before and after filtering with FASTQC
- Preprocessing
- ULI preprocessing
- Demultiplexing and trimming ULI adapters with LIMA
- Mark duplicates with PBMARKDUP
- Filtering – Filtering PacBio data before alignment with HIFI_TRIMMER
- ULI preprocessing
- Alignment and Mark duplicates
- Output options – Output options for all read types
- Short reads – Aligning HiC and Illumina reads using BWAMEM2 (by default) or MINIMAP2
- Oxford Nanopore reads – Aligning ONT reads using MINIMAP2
- PacBio reads – Aligning PacBio CLR and CCS filtered reads using MINIMAP2
- Alignment post-processing
- Merge by speciemen - Merge aligned reads by specimens
- External metadata – Additional metadata in alignments
- Read coverage – Read coverage calculations
- Statistics – Alignment statistics
- Workflow reporting
- Pipeline information - Report metrics generated during the workflow execution
- MultiQC report – Combined input/output QC summary
Output overview
pipeline_info - execution information of run
read_mapping
${datatype}/${specimen}
${run}/
${assembly}.${datatype}.${specimen}.${run}.${aligner}.cram: Aligned CRAM file (or .bam depending on --outfmt)
${assembly}.${datatype}.${specimen}.${run}.${aligner}.cram.crai: Index for the alignment
${assembly}.${datatype}.${specimen}.${run}.${aligner}.coverage.${window_size}.bedGraph.gz: Read coverage in bedGraph format
qc/
${datatype}.${specimen}.${run}.fastqc.html: FASTQC report of reads
${datatype}.${specimen}.${run}.fastqc.zip: FASTQC archive of reads
pacbio.${specimen}.${run}.rmdup.pbmarkdup.log: if library: uli, PBMARKDUP report of markduplicated PacBio reads (optional)
pacbio.${specimen}.${run}.lima.report: if library: uli, LIMA report of adapter trimming and demultiplexing (optional)
pacbio.${specimen}.${run}.filtered.fastqc.html: FASTQC report of filtered reads (optional, if filtered reads)
pacbio.${specimen}.${run}.filtered.fastqc.zip: FASTQC archive of filtered reads (optional, if filtered reads )
pacbio.${specimen}.${run}.hifitrimmer.bed.gz: HiFi trimmer trimming regions (optional, if filtered reads)
pacbio.${specimen}.${run}.hifitrimmer.summary.json: HiFi trimmer trimming summary (optional, if filtered reads)
stats/
${assembly}.${datatype}.${specimen}.${run}.${aligner}.flagstat: Number of alignments for each FLAG type
${assembly}.${datatype}.${specimen}.${run}.${aligner}.idxstats: Alignment summary statistics
${assembly}.${datatype}.${specimen}.${run}.${aligner}.stats.gz: Comprehensive statistics
merged_${#}/ (optional if params.merged_output is specified)
${assembly}.${datatype}.${specimen}.merged_${#}.${aligner}.cram: Merged aligned CRAM file
${assembly}.${datatype}.${specimen}.merged_${#}.${aligner}.cram.crai: Index for the merged alignment
${assembly}.${datatype}.${specimen}.merged_${#}.${aligner}.coverage.${window_size}.bedGraph.gz: Read coverage for merged file
stats/
${assembly}.${datatype}.${specimen}.merged_${#}.${aligner}.flagstat: Number of alignments for each FLAG type
${assembly}.${datatype}.${specimen}.merged_${#}.${aligner}.idxstats: Merged alignment summary statistics
${assembly}.${datatype}.${specimen}.merged_${#}.${aligner}.stats.gz: Comprehensive statistics for merged alignment
multiqc_report.html: Interactive HTML report summarizing quality metrics from FastQC, alignment statistics, and other quality control data across all samples
Preprocessing
Quality Control
pipeline_info - execution information of runread_mapping${datatype}/${specimen}${run}/${assembly}.${datatype}.${specimen}.${run}.${aligner}.cram: Aligned CRAM file (or.bamdepending on--outfmt)${assembly}.${datatype}.${specimen}.${run}.${aligner}.cram.crai: Index for the alignment${assembly}.${datatype}.${specimen}.${run}.${aligner}.coverage.${window_size}.bedGraph.gz: Read coverage in bedGraph formatqc/${datatype}.${specimen}.${run}.fastqc.html: FASTQC report of reads${datatype}.${specimen}.${run}.fastqc.zip: FASTQC archive of readspacbio.${specimen}.${run}.rmdup.pbmarkdup.log: iflibrary: uli, PBMARKDUP report of markduplicated PacBio reads (optional)pacbio.${specimen}.${run}.lima.report: iflibrary: uli, LIMA report of adapter trimming and demultiplexing (optional)pacbio.${specimen}.${run}.filtered.fastqc.html: FASTQC report of filtered reads (optional, if filtered reads)pacbio.${specimen}.${run}.filtered.fastqc.zip: FASTQC archive of filtered reads (optional, if filtered reads )pacbio.${specimen}.${run}.hifitrimmer.bed.gz: HiFi trimmer trimming regions (optional, if filtered reads)pacbio.${specimen}.${run}.hifitrimmer.summary.json: HiFi trimmer trimming summary (optional, if filtered reads)
stats/${assembly}.${datatype}.${specimen}.${run}.${aligner}.flagstat: Number of alignments for each FLAG type${assembly}.${datatype}.${specimen}.${run}.${aligner}.idxstats: Alignment summary statistics${assembly}.${datatype}.${specimen}.${run}.${aligner}.stats.gz: Comprehensive statistics
merged_${#}/(optional ifparams.merged_outputis specified)${assembly}.${datatype}.${specimen}.merged_${#}.${aligner}.cram: Merged aligned CRAM file${assembly}.${datatype}.${specimen}.merged_${#}.${aligner}.cram.crai: Index for the merged alignment${assembly}.${datatype}.${specimen}.merged_${#}.${aligner}.coverage.${window_size}.bedGraph.gz: Read coverage for merged filestats/${assembly}.${datatype}.${specimen}.merged_${#}.${aligner}.flagstat: Number of alignments for each FLAG type${assembly}.${datatype}.${specimen}.merged_${#}.${aligner}.idxstats: Merged alignment summary statistics${assembly}.${datatype}.${specimen}.merged_${#}.${aligner}.stats.gz: Comprehensive statistics for merged alignment
multiqc_report.html: Interactive HTML report summarizing quality metrics from FastQC, alignment statistics, and other quality control data across all samples
Preprocessing
Quality Control
Input files undergo quality assessment using FASTQC, a widely-used tool for evaluating raw sequencing data. If the input is in CRAM format, it is first converted to FASTQ format to enable compatibility with FASTQC.
Output files
read_mapping/${datatype}/${specimen}/${run}/qc/${datatype}.${specimen}.${run}.fastqc.html: An interactive HTML report summarizing key read quality metrics${datatype}.${specimen}.${run}.fastqc.zip: A compressed archive containing the full set of FASTQC output files
ULI preprocessing
PacBio ULI read (library:uli) are demultiplexed with LIMA and mark duplicated with PBMARKDUP.
Output files
read_mapping/pacbio/${specimen}/${run}/qc/pacbio.${specimen}.${run}.pbmarkdup.log: BED format file with trimming coordinatespacbio.${specimen}.${run}.lima.report: Statistics of demultiplexing & ULI adpater trimming
Filtering
PacBio reads generated using both CLR and CCS technology are filtered using HIFITRIMMER. Additional quality control is performed to check the filtered reads.
Output files
read_mapping/pacbio/${specimen}/${run}/qc/pacbio.${specimen}.${run}.hifitrimmer.bed.gz: BED format file with trimming coordinatespacbio.${specimen}.${run}.hifitrimmer.summary.json: Summary statistics of trimming resultspacbio.${specimen}.${run}.filtered.fastqc.html: FASTQC report of filtered readspacbio.${specimen}.${run}.filtered.fastqc.zip: FASTQC archive of filtered reads
Alignment and Mark duplicates
This section documents the output files from alignment and duplicate marking steps of the pipeline. These files are generated after the Preprocessing step completes.
Output options
- outfmt: Specifies the output format for alignments. It can be set to "bam", "cram", or both, separated by a comma (e.g.,
--outfmt bam,cram). The pipeline will generate output files in the specified formats.
- compression: Specifies the compression method for alignments. It can be set to "none" or "crumble". When set to "crumble", the pipeline compresses the quality scores of the alignments.
- merge_output: Merge output at the individual level. If merge_output is enabled (default: false), both unmerged and merged output files per sample will be generated; otherwise, only unmerged files are exported.
Short reads
--outfmt bam,cram). The pipeline will generate output files in the specified formats.Short reads
Short read data from HiC and Illumina technologies is aligned with BWAMEM2_MEM (by default) or MINIMAP2. The sorted alignment files are processed using the SAMTOOLS mark-duplicate workflow. The marked duplicate alignments are output in the CRAM or BAM format.
Output files
read_mapping${datatype}/${specimen}${run}/${assembly}.${datatype}.${specimen}.${run}.${aligner}.cram: Aligned CRAM file (or.bamdepending on--outfmt)${assembly}.${datatype}.${specimen}.${run}.${aligner}.cram.crai: Index for the alignment${assembly}.${datatype}.${specimen}.${run}.${aligner}.coverage.${window_size}.bedGraph.gz: Read coverage in bedGraph format
merged_${#}/- if paramsmerge_output, merged output files with same structure as individual runs, withoutqcfolder
Oxford Nanopore reads
Reads generated using Oxford Nanopore technology are aligned with MINIMAP2_ALIGN. The sorted alignment is output in the CRAM or BAM format.
Output files
read_mappingont/${specimen}${run}/${assembly}.ont.${specimen}.${run}.${aligner}.cram: Aligned CRAM file (or.bamdepending on--outfmt)${assembly}.ont.${specimen}.${run}.${aligner}.cram.crai: Index for the alignment${assembly}.ont.${specimen}.${run}.${aligner}.coverage.${window_size}.bedGraph.gz: Read coverage in bedGraph format
merged_${#}/- if paramsmerge_output.
PacBio reads
The filtered PacBio reads are aligned with MINIMAP2_ALIGN. The sorted alignment is output in the CRAM or BAM format.
Output files
read_mappingpacbio/${specimen}${run}/${assembly}.pacbio.${specimen}.${run}.${aligner}.cram: Aligned CRAM file (or.bamdepending on--outfmt)${assembly}.pacbio.${specimen}.${run}.${aligner}.cram.crai: Index for the alignment${assembly}.pacbio.${specimen}.${run}.${aligner}.coverage.${window_size}.bedGraph.gz: Read coverage in bedGraph format
merged_${#}/- if paramsmerge_output.
Alignment post-processing
External metadata
External metadata
If provided using the --header option, all output alignments (*.cram or *.bam) will include any additional metadata supplied as a SAM header template, replacing the existing @HD and @SD entries (note that this behaviour can be altered by modifying the ext.args for SAMTOOLS_REHEADER in modules.config).
Read coverage
Read coverage of the output alignment file is calculated with blobtk depth and output alongside the alignment files.
File naming: ${assembly}.${type}.${specimen}.${run}.${aligner}.coverage.${window_size}.bedGraph.gz
The ${window_size} is formatted as <N>k when the window size for coverage calculation (params.window_size) is divisible by 1000 (for example 1k) and <N>bp otherwise (for example 1500bp).
Statistics
The output alignments are used to calculate mapping statistics. Output files are generated using SAMTOOLS_STATS, SAMTOOLS_FLAGSTAT and SAMTOOLS_IDXSTATS and are organized in stats/ subdirectories of each run or merged specimen:
File naming:
${assembly}.${datatype}.${specimen}.${run}.${aligner}.flagstat: Number of alignments for each FLAG type${assembly}.${datatype}.${specimen}.${run}.${aligner}.idxstats: Alignment summary statistics${assembly}.${datatype}.${specimen}.${run}.${aligner}.stats.gz: Comprehensive statistics
For merged output (when merge_output is enabled), replace ${run} with merged_${#} in the filenames.
Workflow reporting
Pipeline information
Output files
pipeline_info/
execution_report_<timestamp>.html: Nextflow execution report
execution_timeline_<timestamp>.html: Nextflow execution timeline visualization
execution_trace_<timestamp>.txt: Nextflow execution trace with resource usage details
pipeline_dag_<timestamp>.html: Pipeline DAG (Directed Acyclic Graph) visualization
params_<timestamp>.json: Parameters used in the pipeline run
readmapping_software_mqc_versions.yml: Software versions used in the workflow
MultiQC report
The workflow generates a MultiQC summary report that aggregates and visualises statistics (e.g., FastQC, alignment statistics).
Output files
multiqc_report.html: Interactive HTML report summarizing quality metrics from FastQC, alignment statistics, and other quality control data across all samples
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
Pipeline information
Output files
pipeline_info/execution_report_<timestamp>.html: Nextflow execution reportexecution_timeline_<timestamp>.html: Nextflow execution timeline visualizationexecution_trace_<timestamp>.txt: Nextflow execution trace with resource usage detailspipeline_dag_<timestamp>.html: Pipeline DAG (Directed Acyclic Graph) visualizationparams_<timestamp>.json: Parameters used in the pipeline runreadmapping_software_mqc_versions.yml: Software versions used in the workflow
MultiQC report
The workflow generates a MultiQC summary report that aggregates and visualises statistics (e.g., FastQC, alignment statistics).
Output files
multiqc_report.html: Interactive HTML report summarizing quality metrics from FastQC, alignment statistics, and other quality control data across all samples
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.