Introduction
This document describes the output produced by the genomeassembly pipeline.
The standard assembly pipeline contains running hifiasm
on the HiFi reads, purging the primary contigs with purge_dups
, and scaffolding them up with YaHS
.
Optionally, if Illumina 10X data is provided, the purged contigs and haplotigs can be polished.
In case of a diploid genome when HiFi and HiC data come from the same individual an additional hifiasm run in HiC mode produces two balanced fully phased haplotypes. The haplotypes are not purged but scaffolded up directly with YaHS
.
Optionally, the organelles assembly can be triggered. The mitochondrion and (if relevant) plastid sequences are produced using MitoHiFi
and OATK
.
The directories listed below will be created in the --outdir
directory after the pipeline has finished. All paths are relative to the top-level --outdir
directory.
Subworkflows
The pipeline is built using Nextflow DSL2.
PREPARE_INPUT
Here the input YAML is being processed. This subworkflow generates the input channels used as by the other subworkflows.
GENOMESCOPE_MODEL
Output files
kmer/*ktab
- kmer table file
kmer/*hist
- kmer histogram file
kmer/*model.txt
- genomescope model in text format
kmer/*[linear,log]_plot.png
- genomescope kmer plots
Output files
kmer/*ktab
- kmer table file
kmer/*hist
- kmer histogram file
kmer/*model.txt
- genomescope model in text format
kmer/*[linear,log]_plot.png
- genomescope kmer plots
This subworkflow generates a KMER database and coverage model used in PURGE_DUPS and GENOME_STATISTICS
RAW_ASSEMBLY
Output files
.*hifiasm.*/.*p_ctg.[g]fa
- primary assembly in GFA and FASTA format; for more details refer to hifiasm output
.*hifiasm.*/.*a_ctg.[g]fa
- haplotigs in GFA and FASTA format; for more details refer to hifiasm output
.*hifiasm-hic.*/.*hap1.p_ctg.[g]fa
- fully phased hap1 if hifiasm is run in HiC mode; for more details refer to hifiasm output
.*hifiasm-hic.*/.*hap2.p_ctg.[g]fa
- fully phased hap2 if hifiasm is run in HiC mode; for more details refer to hifiasm output
.*hifiasm.*/.*bin
- internal binary hifiasm files; for more details refer here
Output files
.*hifiasm.*/.*p_ctg.[g]fa
- primary assembly in GFA and FASTA format; for more details refer to hifiasm output
.*hifiasm.*/.*a_ctg.[g]fa
- haplotigs in GFA and FASTA format; for more details refer to hifiasm output
.*hifiasm-hic.*/.*hap1.p_ctg.[g]fa
- fully phased hap1 if hifiasm is run in HiC mode; for more details refer to hifiasm output
.*hifiasm-hic.*/.*hap2.p_ctg.[g]fa
- fully phased hap2 if hifiasm is run in HiC mode; for more details refer to hifiasm output
.*hifiasm.*/.*bin
- internal binary hifiasm files; for more details refer here
This subworkflow generates a raw assembly(-ies). First, hifiasm is run on the input HiFi reads then raw contigs are converted from GFA into FASTA format, this assembly is due to purging, polishing (optional) and scaffolding further down the pipeline.
PURGE_DUPS
Output files
*.hifiasm..*/purged.fa
- purged primary contigs
*.hifiasm..*/purged.htigs.fa
- haplotigs after purging
- other files from the purge_dups pipeline
- for details refer here
Output files
*.hifiasm..*/purged.fa
- purged primary contigs
*.hifiasm..*/purged.htigs.fa
- haplotigs after purging
- other files from the purge_dups pipeline
- for details refer here
Retained haplotype is identified in primary assembly. The alternate contigs are updated correspondingly. The subworkflow relies on kmer coverage model to identify coverage thresholds. For more details see purge_dups The two haplotype assemblies produced by hifiasm in HiC mode are not purged.
POLISHING
Output files
*.hifiasm..*/polishing/.*consensus.fa
- polished joined primary and haplotigs assembly
*.hifiasm..*/polishing/merged.vcf.gz
- unfiltered variants
*.hifiasm..*/polishing/merged.vcf.gz.tbi
- index file
*.hifiasm..*/polishing/refdata-*
- Longranger assembly indices
Output files
*.hifiasm..*/polishing/.*consensus.fa
- polished joined primary and haplotigs assembly
*.hifiasm..*/polishing/merged.vcf.gz
- unfiltered variants
*.hifiasm..*/polishing/merged.vcf.gz.tbi
- index file
*.hifiasm..*/polishing/refdata-*
- Longranger assembly indices
This subworkflow uses read mapping of the Illumina 10X short read data to fix short errors in primary contigs and haplotigs.
HIC_MAPPING
Output files
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/.*_merged_sorted.bed
- bed file obtained from merged mkdup bam
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/.*mkdup.bam
- final read mapping bam with mapped reads
Output files
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/.*_merged_sorted.bed
- bed file obtained from merged mkdup bam
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/.*mkdup.bam
- final read mapping bam with mapped reads
This subworkflow implements alignment of the Illumina HiC short reads to the primary assembly. Uses CONVERT_STATS
as internal subworkflow to calculate read mapping stats.
CONVERT_STATS
Output files
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/.*.stats
- output of samtools stats
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/.*.idxstats
- output of samtools idxstats
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/.*.flagstat
- output of samtools flagstat
Output files
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/.*.stats
- output of samtools stats
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/.*.idxstats
- output of samtools idxstats
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/.*.flagstat
- output of samtools flagstat
This subworkflow produces statistcs for a bam file containing read mapping. It is executed within HIC_MAPPING
subworkflow.
SCAFFOLDING
Output files
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/out_scaffolds_final.fa
- scaffolds in FASTA format
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/out_scaffolds_final.agp
- coordinates of contigs relative to scaffolds
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/alignments_sorted.txt
- Alignments for Juicer in text format
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/yahs_scaffolds.hic
- Juicer HiC map
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/*cool
- HiC map for cooler
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/*.FullMap.png
- Pretext snapshot
Output files
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/out_scaffolds_final.fa
- scaffolds in FASTA format
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/out_scaffolds_final.agp
- coordinates of contigs relative to scaffolds
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/alignments_sorted.txt
- Alignments for Juicer in text format
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/yahs_scaffolds.hic
- Juicer HiC map
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/*cool
- HiC map for cooler
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/*.FullMap.png
- Pretext snapshot
The subworkflow performs scaffolding of the primary contigs using HiC mapping generated in HIC_MAPPING
. It also performs some postprocessing steps such as generating cooler and pretext files
GENOME_STATISTICS
Output files
.*.assembly_summary
- numeric statistics for pri and alt sequences
.*ccs.merquryk
- folder with merqury plots and kmer statistics
.*busco
- folder with BUSCO results
Output files
.*.assembly_summary
- numeric statistics for pri and alt sequences
.*ccs.merquryk
- folder with merqury plots and kmer statistics
.*busco
- folder with BUSCO results
This subworkflow is used to evaluate the quality of sequences. It is performed after the intermidate steps, such as raw assembly generation, purging and polishing, and also at the end of the pipeline when scaffolds are produced.
ORGANELLES
Output files
*.hifiasm.*/mito..*/final_mitogenome.fasta
- organelle assembly
*.hifiasm.*/mito..*/final_mitogenome.[gb,gff]
- organelle gene annotation
*.hifiasm.*/mito..*/contigs_stats.tsv
- summary of mitochondrial findings
- output also includes other output files produced by MitoHiFi
*.hifiasm.*/oatk/.*mito.ctg.fasta
- mitochondrion assembly
*.hifiasm.*/oatk/.*mito.gfa
- assembly graph for the mitochondrion assembly
*.hifiasm.*/oatk/.*pltd.ctg.fasta
- plastid assembly
*.hifiasm.*/oatk/.*pltd.gfa
- assembly graph for the plastid assembly
- output also includes other output files produced by oatk
Output files
*.hifiasm.*/mito..*/final_mitogenome.fasta
- organelle assembly
*.hifiasm.*/mito..*/final_mitogenome.[gb,gff]
- organelle gene annotation
*.hifiasm.*/mito..*/contigs_stats.tsv
- summary of mitochondrial findings
- output also includes other output files produced by MitoHiFi
*.hifiasm.*/oatk/.*mito.ctg.fasta
- mitochondrion assembly
*.hifiasm.*/oatk/.*mito.gfa
- assembly graph for the mitochondrion assembly
*.hifiasm.*/oatk/.*pltd.ctg.fasta
- plastid assembly
*.hifiasm.*/oatk/.*pltd.gfa
- assembly graph for the plastid assembly
- output also includes other output files produced by oatk
This subworkflow implements assembly of organelles. First it identifies a reference mitochondrion assembly by quering NCBI then MitoHiFi is called on raw HIFI reads and separately on the assembled contigs using the queried reference. Separately OATK is called on the raw reads. For plants an optional path to plastid HMM can be provided in YAML then OATK will be tried for both types of organelles
Pipeline information
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
Output files
genomeassembly_info/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter's are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
.
- Reports generated by Nextflow: