Introduction

Genome After Party is a suite of pipelines to standardise the downstream analyses performed on all genomes produced by the Tree of Life. These include:

These pipelines are created using Nextflow DSL2 and nf-core template. They are designed for portability, scalability and biodiversity. All data generated by the pipelines are available at https://gap.cog.sanger.ac.uk/.

Currently we routinely run readmapping, genomenote, and blobtoolkit on Sanger assemblies (mostly the primary haplotypes, but some alternate haplotypes too). Eventually, we'll be running everything but variantcalling/variantcomposition on all assemblies (even non-Sanger ones), and variantcalling/variantcomposition only on primary haplotypes.

You can see all planned features and requests on the project board. If you have an idea for a new feature – send us your request.

Finally, we have ideas for future pipelines.

INSDC Download

sanger-tol/insdcdownload downloads assemblies from the NCBI.

Current features:

  • Download genome from NCBI as Fasta.
  • Put the unmasked version under assembly/release/ and the masked version under analysis/.
  • Build samtools faidx and dict indices on the genome assemblies.
  • Create BED file with the coordinates of the masked region.
  • Compress and index the BED file with bgzip and tabix.
  • Prepare a file that can be used to populate BAM headers.

Planned features:

  • Generate a mapping file between accession numbers, chromosome numbers, and sequence names.
  • Generate a SAM header with sequence aliases and other metadata such as species or sample name.

Ensembl Repeat Download

sanger-tol/ensemblrepeatdownload downloads repeat annotations from Ensembl.

Current features:

  • Download the masked FASTA file from Ensembl.
  • Extract the coordinates of the masked regions into a BED file.
  • Compress and index the BED file with bgzip and tabix.

Planned features:

  • Repeat density sub-workflow.
  • Retrieve repeat annotations, not just coordinates.

Ensembl Gene Download

sanger-tol/ensemblgenedownload downloads gene annotations from Ensembl.

Current features:

  • Download from Ensembl gene annotation in GFF3 format.
  • Download from Ensembl gene sequences in FASTA format.
  • Compress and index all sequences files with bgzip, samtools faidx, and samtools dict.
  • Compress and index the annotation files with bgzip and tabix.

Planned features:

  • Gene density sub-workflow.
  • Sub-tracks for each biotype.

Read Mapping

sanger-tol/readmapping aligns reads generated using Illumina, HiC, PacBio and Nanopore technologies against a genome assembly.

Current features:

  • Align short read data (HiC and Illumina) against the genome with bwamem2.
  • Mark duplicates for short read alignment with samtools.
  • Filter PacBio raw read data using vector database.
  • Align long read data (ONT, PacBio CCS and PacBio CLR) against the genome with minimap2.
  • Merge all alignment files at the individual level and convert to CRAM format.
  • Calculate statistics for all alignment files using samtools stats, flagstat, and idxstats.
  • Read chunking to speed up alignment for all technologies.
  • Rich metadata in aligned file headers.
  • Support compression with crumble for aligned files.
  • Support multiple output options – BAM, compressed BAM, CRAM, compressed CRAM.

Planned features:

  • Add support for RNAseq data.
  • Replace the "sample" samplesheet parameter with one for the specimen and one for the sequencing run.

Genome Note

sanger-tol/genomenote generates all the data (tables and figures) used in genome note publications. These include (1) assembly information, statistics and chromosome details, (2) PacBio consensus quality and k-mer completeness, and (3) HiC contact maps and mapping statistics.

Current features:

  • Create HiC contact map and chromosomal grid using Cooler.
  • Retrieve assembly information, statistics and chromosome details from NCBI datasets.
  • Compute genome completeness with BUSCO.
  • Compute sequence quality and k-mer completeness with the FastK/MerquryFK suite of tools, using the alternative haplotype.
  • Compute the percentage of HiC primary mappings with samtools flagstat.
  • Run GFA stats.
  • Broad fetching of genome metadata from numerous sources.
  • Create summary table with the information above.
  • Combine results and metadata with template Word document.
  • Run gfastats.

Planned features:

  • Add optional read mapping subworkflow.
  • Generate combined HiC contact maps
  • Generate smudgeplots

BlobToolKit

sanger-tol/blobtoolkit is used to identify and analyse non-target DNA for eukaryotic genomes.

Current features:

  • Calculate sequence statistics in 1kb windows for each contig.
  • Count BUSCOs in 1kb windows for each contig using specific and basal lineages.
  • Calculate coverage in 1kb windows using blobtk depth.
  • Aggregate 1kb values into windows of fixed proportion (10%, 1% of contig length) and fixed length (100kb, 1Mb).
  • Diamond blastp search of BUSCO gene models for basal lineages (archaea_odb10, bacteria_odb10 and eukaryota_odb10) against the UniProt reference proteomes.
  • Diamond blastx search of assembly contigs against the UniProt reference proteomes
  • NCBI blastn search of assembly contigs with no Diamond blastx match against the NCBI nt database
  • Optional read mapping subworkflow
  • Import analysis results into a BlobDir dataset.
  • BlobDir validation and static image generation.

Planned features:

  • Compute read coverage with k-mer based methods.
  • Accept pre-computed read-coverage and fasta_windows analyses.

Sequence Composition

sanger-tol/sequencecomposition extracts statistics from a genome about its sequence composition.

Current features:

  • Run fasta_windows on the genome FASTA file.
  • Extract single-statistics bedGraph files from the multi-statistics outputs.
  • Compress and index all bedGraph and TSV files with bgzip and tabix.

Planned features:

  • Add simple repeat finders:
    • Low complexity repeats from Dustmasker.
    • Inverted repeats from einverted.
    • LTR retrotransposons from LTRharvest and LTRdigest.
    • Tandem repeats from trf.
    • Telomeric repeat annotation (tool to be confirmed).
    • Centromeric repeat annotation (tool to be confirmed).
  • Add comprehensive repeat finders such as EarlGreyTE or EDTA.
  • Add TRASH (Tandem Repeat Annotation and Structural Hierarchy) pipeline.
  • Add Pantera pipeline.
  • Add stainedglass and/or ModDotPlot pipeline.
  • Mappability tracks.

Variant Calling

sanger-tol/variantcalling calls (short) variants on PacBio data using DeepVariant.

Current features:

  • Can combine multiple libraries from the same sample.
  • Optional read mapping subworkflow.
  • Calls variants using DeepVariant for PacBio long read data.
  • Speed improvements made by splitting the genome before calling variants.
  • Outputs both VCF and GVCF formats.

Planned features:

  • Add structural variation detection.
  • Remove the ability to create bedGraph for distribution of heterozygous sites across genome as this will be the remit of sanger-tol/variantcomposition.

Variant Composition

sanger-tol/variantcomposition analyses variant calls generated by sanger-tol/variantcalling or elsewhere.

Note: the pipeline only has a pre-release (v0.1.0). It will be considered ready to general use in v0.2.0.

Current features:

  • Statistics.
  • Distribution of heterozygous sites across genome.
  • Runs of homozygosity.
  • InDel size distribution.

Planned features:

  • VCF filtering options.

Future pipelines

  • Pipeline to run BUSCO (outputs can then be used by blobtoolkit and genomenote). A preliminary version is implemented in sanger-tol/busco
  • Pipeline to download RefSeq annotations.