Edit

Genome After Party

Genome analysis pipelines.

Introduction

Genome After Party is a suite of pipelines to standardise the downstream analyses performed on all genomes produced by the Tree of Life. These include:

sanger-tol/insdcdownload downloads assemblies from the NCBI.
sanger-tol/ensemblrepeatdownload downloads repeat annotations from Ensembl.
sanger-tol/ensemblgenedownload downloads gene annotations from Ensembl.
sanger-tol/readmapping for aligning reads generated using Illumina, HiC, PacBio and Nanopore technologies against a genome assembly.
sanger-tol/genomenote creates HiC contact maps and collates (1) assembly information, statistics and chromosome details, (2) PacBio consensus quality and k-mer completeness, and (3) HiC mapping statistics.
sanger-tol/blobtoolkit is used to identify and analyse non-target DNA for eukaryotic genomes.
sanger-tol/sequencecomposition extracts statistics from a genome about its sequence composition.
sanger-tol/variantcalling for calling variants using DeepVariant with PacBio data.

These pipelines are created using Nextflow DSL2 and nf-core template. They are designed for portability, scalability and biodiversity. All data generated by the pipelines are available at https://gap.cog.sanger.ac.uk/.

Currently we routinely run readmapping, genomenote, and blobtoolkit on Sanger assemblies (mostly the primary haplotypes, but some alternate haplotypes too). Eventually, we'll be running everything but variantcalling on all assemblies (even non-Sanger ones), and variantcalling only on primary haplotypes (including non-Sanger ones).

You can see all planned features and requests on the project board. If you have an idea for a new feature – send us your request.

INSDC Download

sanger-tol/insdcdownload downloads assemblies from the NCBI.

Current features:

Download genome from NCBI as Fasta.
Put the unmasked version under assembly/release/ and the masked version under analysis/.
Build samtools faidx and dict indices on the genome assemblies.
Create BED file with the coordinates of the masked region.
Compress and index the BED file with bgzip and tabix.
Prepare a file that can be used to populate BAM headers.

Planned features:

Download RefSeq annotations.
Review the usage of ENA vs NCBI, and the naming of the pipeline.

Ensembl Repeat Download

sanger-tol/ensemblrepeatdownload downloads repeat annotations from Ensembl.

Current features:

Download the masked FASTA file from Ensembl.
Extract the coordinates of the masked regions into a BED file.
Compress and index the BED file with bgzip and tabix.

Planned features:

Repeat density sub-workflow.
Retrieve repeat annotations, not just coordinates.

Ensembl Gene Download

sanger-tol/ensemblgenedownload downloads gene annotations from Ensembl.

Current features:

Download from Ensembl gene annotation in GFF3 format.
Download from Ensembl gene sequences in FASTA format.
Compress and index all sequences files with bgzip, samtools faidx, and samtools dict.
Compress and index the annotation files with bgzip and tabix.

Planned features:

Gene density sub-workflow.
Sub-tracks for each biotype.

Read Mapping

sanger-tol/readmapping aligns reads generated using Illumina, HiC, PacBio and Nanopore technologies against a genome assembly.

Current features:

Align short read data (HiC and Illumina) against the genome with bwamem2 mem.
Mark duplicates for short read alignment with samtools.
Filter PacBio raw read data using vector database.
Align long read data (ONT, PacBio CCS and PacBio CLR) against the genome with minimap align.
Merge all alignment files at the individual level and convert to CRAM format.
Calculate statistics for all alignment files using samtools stats, flagstat, and idxstats.
Read chunking to speed up alignment for all technologies.
Rich metadata in aligned file headers.
Support compression with crumble for aligned files.
Support multiple output options – BAM, compressed BAM, CRAM, compressed CRAM.

Planned features:

Use hifi-trimmer to filter PacBio reads.
Add calculation for PacBio filtered data percentage.
Add support for Pacbio ULI reads.
Add support for RNAseq data.

Genome Note

sanger-tol/genomenote generates all the data (tables and figures) used in genome note publications. These include (1) assembly information, statistics and chromosome details, (2) PacBio consensus quality and k-mer completeness, and (3) HiC contact maps and mapping statistics.

Current features:

Create HiC contact map and chromosomal grid using Cooler.
Retrieve assembly information, statistics and chromosome details from NCBI datasets.
Compute genome completeness with BUSCO.
Compute sequence quality and k-mer completeness with the FastK/MerquryFK suite of tools.
Compute the percentage of HiC primary mappings with samtools flagstat.
Run GFA stats.
Broad fetching of genome metadata from numerous sources.
Create summary table with the information above.
Combine results and metadata with template Word document.

Planned features:

Process principal and alternate haplotypes together.
Add optional read mapping subworkflow.

BlobToolKit

sanger-tol/blobtoolkit is used to identify and analyse non-target DNA for eukaryotic genomes.

Current features:

Calculate sequence statistics in 1kb windows for each contig.
Count BUSCOs in 1kb windows for each contig using specific and basal lineages.
Calculate coverage in 1kb windows using blobtk depth.
Aggregate 1kb values into windows of fixed proportion (10%, 1% of contig length) and fixed length (100kb, 1Mb).
Diamond blastp search of BUSCO gene models for basal lineages (archaea_odb10, bacteria_odb10 and eukaryota_odb10) against the UniProt reference proteomes.
Diamond blastx search of assembly contigs against the UniProt reference proteomes
NCBI blastn search of assembly contigs with no Diamond blastx match against the NCBI nt database
Optional read mapping subworkflow
Import analysis results into a BlobDir dataset.
BlobDir validation and static image generation.

Planned features:

Runtime improvement for the blastn subworkflow.
Compute read coverage with k-mer based methods.

Sequence Composition

sanger-tol/sequencecomposition extracts statistics from a genome about its sequence composition.

Current features:

Run fasta_windows on the genome FASTA file.
Extract single-statistics bedGraph files from the multi-statistics outputs.
Compress and index all bedGraph and TSV files with bgzip and tabix.

Planned features:

Add simple repeat finders:
- Low complexity repeats from Dustmasker.
- Inverted repeats from einverted.
- LTR retrotransposons from LTRharvest and LTRdigest.
- Tandem repeats from trf.
- Telomeric repeat annotation (tool to be confirmed).
- Centromeric repeat annotation (tool to be confirmed).
Add comprehensive repeat finders such as EarlGreyTE or EDTA.
Add TRASH (Tandem Repeat Annotation and Structural Hierarchy) pipeline.
Add Pantera pipeline.
Add stainedglass and/or ModDotPlot pipeline.
Mappability tracks.

Variant Calling

sanger-tol/variantcalling calls (short) variants on PacBio data using DeepVariant.

Current features:

Can combine multiple libraries from the same sample.
Optional read mapping subworkflow.
Calls variants using DeepVariant for PacBio long read data.
Speed improvements made by splitting the genome before calling variants.
Outputs both VCF and GVCF formats.
Create bedGraph for distribution of heterozygous sites across genome

Planned features:

Add structural variation detection

We're considering implementing the following as a separate pipeline called sanger-tol/variantcomposition.

Add calculation for heterozygosity.
Compute runs of homozygosity
Calculate InDel size distribution.

and move "Create bedGraph for distribution of heterozygous sites across genome" over here.

Final words

Additionally, all pipelines are in need of:

Update the pipeline template.
Update the samplesheet validation steps.
Implement nf-test.

And we need a strategy for converting all outputs to bigBed/bigWig and building public track-hubs.