Edit

sanger-tol/
genomeassembly

Implementation of ToL genome assembly workflows

This pipeline is currently in development and does not yet have any stable releases.

https://github.com/sanger-tol/genomeassembly

Introduction

This document describes the output produced by the pipeline.

The standard assembly pipeline contains running hifiasm on the HiFi reads, purging the primary contigs with purge_dups, and scaffolding them up with YaHS. Optionally, if Illumina 10X data is provided, the purged contigs and haplotigs can be polished.

In case of a diploid genome when HiFi and HiC data come from the same individual an additional hifiasm run in HiC mode produces two balanced fully phased haplotypes. The haplotypes are not purged but scaffolded up directly with YaHS.

Optionally, the organelles assembly can be triggered. The mitochondrion and (if relevant) plastid sequences are produced using MitoHiFi and OATK.

The directories listed below will be created in the --outdir directory after the pipeline has finished. All paths are relative to the top-level --outdir directory.

Subworkflows

The pipeline is built using Nextflow DSL2.

PREPARE_INPUT

Here the input YAML is being processed. This subworkflow generates the input channels used as by the other subworkflows.

GENOMESCOPE_MODEL

Output files

kmer/*ktab
- kmer table file
kmer/*hist
- kmer histogram file
kmer/*model.txt
- genomescope model in text format
kmer/*[linear,log]_plot.png
- genomescope kmer plots

This subworkflow generates a KMER database and coverage model used in PURGE_DUPS and GENOME_STATISTICS

Subworkflow for kmer profile

RAW_ASSEMBLY

Output files

.*hifiasm.*/.*p_ctg.[g]fa
- primary assembly in GFA and FASTA format; for more details refer to hifiasm output
.*hifiasm.*/.*a_ctg.[g]fa
- haplotigs in GFA and FASTA format; for more details refer to hifiasm output
.*hifiasm-hic.*/.*hap1.p_ctg.[g]fa
- fully phased hap1 if hifiasm is run in HiC mode; for more details refer to hifiasm output
.*hifiasm-hic.*/.*hap2.p_ctg.[g]fa
- fully phased hap2 if hifiasm is run in HiC mode; for more details refer to hifiasm output
.*hifiasm.*/.*bin
- internal binary hifiasm files; for more details refer here

This subworkflow generates a raw assembly(-ies). First, hifiasm is run on the input HiFi reads then raw contigs are converted from GFA into FASTA format, this assembly is due to purging, polishing (optional) and scaffolding further down the pipeline.

Raw assembly subworkflow

PURGE_DUPS

Output files

*.hifiasm..*/purged.fa
- purged primary contigs
*.hifiasm..*/purged.htigs.fa
- haplotigs after purging
other files from the purge_dups pipeline - for details refer here

Retained haplotype is identified in primary assembly. The alternate contigs are updated correspondingly. The subworkflow relies on kmer coverage model to identify coverage thresholds. For more details see purge_dups The two haplotype assemblies produced by hifiasm in HiC mode are not purged.

Subworkflow for purging haplotigs

POLISHING

Output files

*.hifiasm..*/polishing/.*consensus.fa
- polished joined primary and haplotigs assembly
*.hifiasm..*/polishing/merged.vcf.gz
- unfiltered variants
*.hifiasm..*/polishing/merged.vcf.gz.tbi
- index file
*.hifiasm..*/polishing/refdata-*
- Longranger assembly indices

This subworkflow uses read mapping of the Illumina 10X short read data to fix short errors in primary contigs and haplotigs.

Subworkflow for purging haplotigs

HIC_MAPPING

Output files

*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/.*_merged_sorted.bed
- bed file obtained from merged mkdup bam
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/.*mkdup.bam - final read mapping bam with mapped reads

This subworkflow implements alignment of the Illumina HiC short reads to the primary assembly. Uses CONVERT_STATS as internal subworkflow to calculate read mapping stats.

HiC mapping subworkflow

CONVERT_STATS

Output files

*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/.*.stats
- output of samtools stats
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/.*.idxstats
- output of samtools idxstats
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/.*.flagstat - output of samtools flagstat

This subworkflow produces statistcs for a bam file containing read mapping. It is executed within HIC_MAPPING subworkflow.

SCAFFOLDING

Output files

*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/out_scaffolds_final.fa
- scaffolds in FASTA format
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/out_scaffolds_final.agp
- coordinates of contigs relative to scaffolds
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/alignments_sorted.txt
- Alignments for Juicer in text format
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/yahs_scaffolds.hic
- Juicer HiC map
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/*cool
- HiC map for cooler
*.hifiasm.*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/*.FullMap.png
- Pretext snapshot

The subworkflow performs scaffolding of the primary contigs using HiC mapping generated in HIC_MAPPING. It also performs some postprocessing steps such as generating cooler and pretext files

Scaffolding subworkflow

GENOME_STATISTICS

Output files

.*.assembly_summary
- numeric statistics for pri and alt sequences
.*ccs.merquryk
- folder with merqury plots and kmer statistics
.*busco
- folder with BUSCO results

This subworkflow is used to evaluate the quality of sequences. It is performed after the intermidate steps, such as raw assembly generation, purging and polishing, and also at the end of the pipeline when scaffolds are produced.

Genome statistics subworkflow

ORGANELLES

Output files

*.hifiasm.*/mito..*/final_mitogenome.fasta
- organelle assembly
*.hifiasm.*/mito..*/final_mitogenome.[gb,gff]
- organelle gene annotation
*.hifiasm.*/mito..*/contigs_stats.tsv
- summary of mitochondrial findings
output also includes other output files produced by MitoHiFi
*.hifiasm.*/oatk/.*mito.ctg.fasta
- mitochondrion assembly
*.hifiasm.*/oatk/.*mito.gfa
- assembly graph for the mitochondrion assembly
*.hifiasm.*/oatk/.*pltd.ctg.fasta
- plastid assembly
*.hifiasm.*/oatk/.*pltd.gfa
- assembly graph for the plastid assembly
output also includes other output files produced by oatk

This subworkflow implements assembly of organelles. First it identifies a reference mitochondrion assembly by quering NCBI then MitoHiFi is called on raw HIFI reads and separately on the assembled contigs using the queried reference. Separately OATK is called on the raw reads. For plants an optional path to plastid HMM can be provided in YAML then OATK will be tried for both types of organelles

Organelles subworkflow

Pipeline information

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

Output files

genomeassembly_info/
- Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter's are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
- Parameters used by the pipeline run: params.json.