Introduction
Genome After Party is a suite of pipelines to standardise the downstream analyses performed on all genomes produced by the Tree of Life. These include:
- sanger-tol/insdcdownload downloads assemblies from the NCBI.
- sanger-tol/ensemblrepeatdownload downloads repeat annotations from Ensembl.
- sanger-tol/ensemblgenedownload downloads gene annotations from Ensembl.
- sanger-tol/readmapping for aligning reads generated using Illumina, HiC, PacBio and Nanopore technologies against a genome assembly.
- sanger-tol/genomenote creates HiC contact maps and collates (1) assembly information, statistics and chromosome details, (2) PacBio consensus quality and k-mer completeness, and (3) HiC mapping statistics.
- sanger-tol/blobtoolkit is used to identify and analyse non-target DNA for eukaryotic genomes.
- sanger-tol/sequencecomposition extracts statistics from a genome about its sequence composition.
- sanger-tol/variantcalling for calling variants using DeepVariant with PacBio data.
- sanger-tol/variantcomposition for analysing variant calls generated by sanger-tol/variantcalling or elsewhere.
These pipelines are created using Nextflow DSL2 and nf-core template. They are designed for portability, scalability and biodiversity. All data generated by the pipelines are available at https://gap.cog.sanger.ac.uk/.
Currently we routinely run readmapping, genomenote, and blobtoolkit on Sanger assemblies (mostly the primary haplotypes, but some alternate haplotypes too). Eventually, we'll be running everything on primary haplotypes and everything but variantcalling/variantcomposition on other haplotypes.
You can see all planned features and requests on the project board. If you have an idea for a new feature – send us your request.
Finally, we have ideas for future pipelines.
INSDC Download
sanger-tol/insdcdownload downloads assemblies from the NCBI.
Current features:
- Download genome from NCBI as Fasta.
- Put the unmasked version under
assembly/release/ and the masked version under analysis/.
- Build
samtools faidx and dict indices on the genome assemblies.
- Create BED file with the coordinates of the masked region.
- Compress and index the BED file with
bgzip and tabix.
- Prepare a file that can be used to populate BAM headers.
- Generate a mapping file between accession numbers, chromosome numbers, and sequence names.
- Generate a SAM header with sequence aliases and other metadata such as species or sample name.
Planned features:
- Generate a list of the sequence names, ordered by position on the karyotype.
- Allow downloading from ENA.
Ensembl Repeat Download
assembly/release/ and the masked version under analysis/.samtools faidx and dict indices on the genome assemblies.bgzip and tabix.Planned features:
- Generate a list of the sequence names, ordered by position on the karyotype.
- Allow downloading from ENA.
Ensembl Repeat Download
sanger-tol/ensemblrepeatdownload downloads repeat annotations from Ensembl.
Current features:
- Download the masked FASTA file from Ensembl.
- Extract the coordinates of the masked regions into a BED file.
- Compress and index the BED file with
bgzip and tabix.
Planned features:
- Repeat density sub-workflow.
- Retrieve repeat annotations, not just coordinates.
Ensembl Gene Download
bgzip and tabix.Planned features:
- Repeat density sub-workflow.
- Retrieve repeat annotations, not just coordinates.
Ensembl Gene Download
sanger-tol/ensemblgenedownload downloads gene annotations from Ensembl.
Current features:
- Download from Ensembl gene annotation in GFF3 format.
- Download from Ensembl gene sequences in FASTA format.
- Compress and index all sequences files with
bgzip, samtools faidx, and samtools dict.
- Compress and index the annotation files with
bgzip and tabix.
Planned features:
- Gene density sub-workflow.
- Sub-tracks for each biotype.
Read Mapping
bgzip, samtools faidx, and samtools dict.bgzip and tabix.Planned features:
- Gene density sub-workflow.
- Sub-tracks for each biotype.
Read Mapping
sanger-tol/readmapping aligns reads generated using Illumina, HiC, PacBio and Nanopore technologies against a genome assembly.
Current features:
- Align short read data (HiC and Illumina) against the genome with
bwamem2.
- Mark duplicates for short read alignment with
samtools.
- Filter PacBio raw read data using vector database.
- Align long read data (ONT, PacBio CCS and PacBio CLR) against the genome with
minimap2.
- Merge all alignment files at the individual level and convert to CRAM format.
- Calculate statistics for all alignment files using
samtools stats, flagstat, and idxstats.
- Read chunking to speed up alignment for all technologies.
- Rich metadata in aligned file headers.
- Support compression with
crumble for aligned files.
- Support multiple output options – BAM, compressed BAM, CRAM, compressed CRAM.
Planned features:
- Add support for PacBio PiMmS data.
- Add support for RNAseq data.
- Add support for ONT data.
- Replace the "sample" samplesheet parameter with one for the specimen and one for the sequencing run.
Genome Note
bwamem2.samtools.minimap2.samtools stats, flagstat, and idxstats.crumble for aligned files.Planned features:
- Add support for PacBio PiMmS data.
- Add support for RNAseq data.
- Add support for ONT data.
- Replace the "sample" samplesheet parameter with one for the specimen and one for the sequencing run.
Genome Note
sanger-tol/genomenote generates all the data (tables and figures) used in genome note publications. These include (1) assembly information, statistics and chromosome details, (2) PacBio consensus quality and k-mer completeness, and (3) HiC contact maps and mapping statistics.
Current features:
- Create HiC contact map and chromosomal grid using
Cooler.
- Retrieve assembly information, statistics and chromosome details from NCBI
datasets.
- Compute genome completeness with
BUSCO.
- Compute sequence quality and k-mer completeness with the
FastK/MerquryFK suite of tools, using the alternative haplotype.
- Compute the percentage of HiC primary mappings with
samtools flagstat.
- Run GFA stats.
- Broad fetching of genome metadata from numerous sources.
- Create summary table with the information above.
- Combine results and metadata with template Word document.
- Run gfastats.
Planned features:
- Generate combined HiC contact maps
- Generate smudgeplots
Long term plan:
- All inputs should be optional. The pipeline should be able to compute everything, either directly or via sub-pipeline launch.
- Align with the Universal Genome Note platform.
BlobToolKit
Cooler.datasets.BUSCO.FastK/MerquryFK suite of tools, using the alternative haplotype.samtools flagstat.Planned features:
- Generate combined HiC contact maps
- Generate smudgeplots
Long term plan:
- All inputs should be optional. The pipeline should be able to compute everything, either directly or via sub-pipeline launch.
- Align with the Universal Genome Note platform.
BlobToolKit
sanger-tol/blobtoolkit is used to identify and analyse non-target DNA for eukaryotic genomes.
Current features:
- Calculate sequence statistics in 1kb windows for each contig.
- Count BUSCOs in 1kb windows for each contig using specific and basal lineages.
- Calculate coverage in 1kb windows using
blobtk depth.
- Aggregate 1kb values into windows of fixed proportion (10%, 1% of contig length) and fixed length (100kb, 1Mb).
Diamond blastp search of BUSCO gene models for basal lineages (archaea_odb10, bacteria_odb10 and eukaryota_odb10) against the UniProt reference proteomes.
Diamond blastx search of assembly contigs against the UniProt reference proteomes
- NCBI
blastn search of assembly contigs with no Diamond blastx match against the NCBI nt database
- Optional read mapping subworkflow
- Import analysis results into a BlobDir dataset.
- BlobDir validation and static image generation.
Planned features:
- Compute read coverage with k-mer based methods.
Long term plan:
- Accept pre-computed read-coverage and fasta_windows analyses.
Sequence Composition
blobtk depth.Diamond blastp search of BUSCO gene models for basal lineages (archaea_odb10, bacteria_odb10 and eukaryota_odb10) against the UniProt reference proteomes.Diamond blastx search of assembly contigs against the UniProt reference proteomesblastn search of assembly contigs with no Diamond blastx match against the NCBI nt databasePlanned features:
- Compute read coverage with k-mer based methods.
Long term plan:
- Accept pre-computed read-coverage and fasta_windows analyses.
Sequence Composition
sanger-tol/sequencecomposition extracts statistics from a genome about its sequence composition.
Current features:
- Run
fasta_windows on the genome FASTA file.
- Extract single-statistics
bedGraph files from the multi-statistics outputs.
- Compress and index all
bedGraph and TSV files with bgzip and tabix.
Planned features:
- Add simple repeat finders:
- Low complexity repeats from
Dustmasker.
- Inverted repeats from
einverted.
- LTR retrotransposons from
LTRharvest and LTRdigest.
- Tandem repeats from
trf.
- Telomeric repeat annotation (tool to be confirmed).
- Centromeric repeat annotation (tool to be confirmed).
- Add comprehensive repeat finders such as EarlGreyTE or EDTA.
- Add TRASH (Tandem Repeat Annotation and Structural Hierarchy) pipeline.
- Add Pantera pipeline.
- Add stainedglass and/or ModDotPlot pipeline.
- Mappability tracks.
Variant Calling
fasta_windows on the genome FASTA file.bedGraph files from the multi-statistics outputs.bedGraph and TSV files with bgzip and tabix.Planned features:
- Add simple repeat finders:
- Low complexity repeats from
Dustmasker. - Inverted repeats from
einverted. - LTR retrotransposons from
LTRharvestandLTRdigest. - Tandem repeats from
trf. - Telomeric repeat annotation (tool to be confirmed).
- Centromeric repeat annotation (tool to be confirmed).
- Low complexity repeats from
- Add comprehensive repeat finders such as EarlGreyTE or EDTA.
- Add TRASH (Tandem Repeat Annotation and Structural Hierarchy) pipeline.
- Add Pantera pipeline.
- Add stainedglass and/or ModDotPlot pipeline.
- Mappability tracks.
Variant Calling
sanger-tol/variantcalling calls (short) variants on PacBio data using DeepVariant.
Current features:
- Can combine multiple libraries from the same sample.
- Optional read mapping subworkflow.
- Calls variants using DeepVariant for PacBio long read data.
- Speed improvements made by splitting the genome before calling variants.
- Outputs both VCF and GVCF formats.
Planned features:
- Add structural variation detection.
- Remove the ability to create bedGraph for distribution of heterozygous sites across genome
as this will be the remit of sanger-tol/variantcomposition.
Variant Composition
Planned features:
- Add structural variation detection.
- Remove the ability to create bedGraph for distribution of heterozygous sites across genome as this will be the remit of sanger-tol/variantcomposition.
Variant Composition
sanger-tol/variantcomposition analyses variant calls generated by sanger-tol/variantcalling or elsewhere.
Note: the pipeline only has a pre-release (v0.1.0). It will be considered ready to general use in v0.2.0.
Current features:
- Statistics.
- Distribution of heterozygous sites across genome.
- Runs of homozygosity.
- InDel size distribution.
Planned features:
- VCF filtering options.
Future pipelines
Planned features:
- VCF filtering options.
Future pipelines
These are just ideas at this stage. All to be confirmed.
Download pipelines
- Pipeline to download RefSeq annotations.
- Combine with the NCBI backend of sanger-tol/insdcdownload and make it sanger-tol/ncbidownload ?
- Combine the two Ensembl download pipelines into a single "sanger-tol/ensembldownload" pipeline that could be expanded to other Ensembl data ?
BUSCO
- Combine with the NCBI backend of sanger-tol/insdcdownload and make it sanger-tol/ncbidownload ?
BUSCO
Input:
- Assembly (Fasta)
- BUSCO lineage list
- Taxon name / ID (optional, but required for "all" lineage)
- Ancestral linkage groups mapping (optional)
- Chromosome list (optional, for ancestral painting)
Features:
- "all" special lineage resolves to all parent lineages
- "basal" special lineage resolves to all three basal lineages
- Run BUSCO on all selected lineages
- Tidy up the output directories
- Option to keep the individual Fasta files an option
- Ancestral painting
- Support for large genomes by splitting BUSCO as Nextflow jobs ?
- A preliminary version of this is implemented in sanger-tol/busco
Genome statistics
Input:
- List of haploid assemblies
- k-mer library or set of PacBio reads
- Option to skip BUSCO
Features:
- Run the genome_statistics sub-workflow on each assembly, using the other assemblies (concatenated) as the "alternate" assembly
- Combine the QV and completeness scores into a table
- Run genomescope on each assembly
GFF import
Input:
- GFF file
- Chromosome name mapping
Features:
- Rename the sequence names in the GFF file when needed
- Compute the GFF stats
- Generate CDS and Protein Fasta files
- Compute the BUSCO scores
- Compute gene density tracks
Repeats import
Input:
- BED file (or whatever format is convenient / common for repeats)
- Chromosome name mapping
Features:
- Rename the sequence names in the BED file when needed
- Generate a masked Fasta file
- Generate masking statistics
- Compute repeat density tracks