Introduction
Genome After Party is a suite of pipelines to standardise the downstream analyses performed on all genomes produced by the Tree of Life. These include:
- sanger-tol/insdcdownload downloads assemblies from the NCBI.
- sanger-tol/ensemblrepeatdownload downloads repeat annotations from Ensembl.
- sanger-tol/ensemblgenedownload downloads gene annotations from Ensembl.
- sanger-tol/readmapping for aligning reads generated using Illumina, HiC, PacBio and Nanopore technologies against a genome assembly.
- sanger-tol/genomenote creates HiC contact maps and collates (1) assembly information, statistics and chromosome details, (2) PacBio consensus quality and k-mer completeness, and (3) HiC mapping statistics.
- sanger-tol/blobtoolkit is used to identify and analyse non-target DNA for eukaryotic genomes.
- sanger-tol/sequencecomposition extracts statistics from a genome about its sequence composition.
- sanger-tol/variantcalling for calling variants using DeepVariant with PacBio data.
- sanger-tol/variantcomposition for analysing variant calls generated by sanger-tol/variantcalling or elsewhere.
These pipelines are created using Nextflow DSL2 and nf-core template. They are designed for portability, scalability and biodiversity. All data generated by the pipelines are available at https://gap.cog.sanger.ac.uk/.
Currently we routinely run readmapping, genomenote, and blobtoolkit on Sanger assemblies (mostly the primary haplotypes, but some alternate haplotypes too). Eventually, we'll be running everything but variantcalling/variantcomposition on all assemblies (even non-Sanger ones), and variantcalling/variantcomposition only on primary haplotypes.
You can see all planned features and requests on the project board. If you have an idea for a new feature – send us your request.
Finally, we have ideas for future pipelines.
INSDC Download
sanger-tol/insdcdownload downloads assemblies from the NCBI.
Current features:
- Download genome from NCBI as Fasta.
- Put the unmasked version under
assembly/release/ and the masked version under analysis/.
- Build
samtools faidx and dict indices on the genome assemblies.
- Create BED file with the coordinates of the masked region.
- Compress and index the BED file with
bgzip and tabix.
- Prepare a file that can be used to populate BAM headers.
Planned features:
- Generate a mapping file between accession numbers, chromosome numbers, and sequence names.
- Generate a SAM header with sequence aliases and other metadata such as species or sample name.
Ensembl Repeat Download
assembly/release/ and the masked version under analysis/.samtools faidx and dict indices on the genome assemblies.bgzip and tabix.Planned features:
- Generate a mapping file between accession numbers, chromosome numbers, and sequence names.
- Generate a SAM header with sequence aliases and other metadata such as species or sample name.
Ensembl Repeat Download
sanger-tol/ensemblrepeatdownload downloads repeat annotations from Ensembl.
Current features:
- Download the masked FASTA file from Ensembl.
- Extract the coordinates of the masked regions into a BED file.
- Compress and index the BED file with
bgzip and tabix.
Planned features:
- Repeat density sub-workflow.
- Retrieve repeat annotations, not just coordinates.
Ensembl Gene Download
bgzip and tabix.Planned features:
- Repeat density sub-workflow.
- Retrieve repeat annotations, not just coordinates.
Ensembl Gene Download
sanger-tol/ensemblgenedownload downloads gene annotations from Ensembl.
Current features:
- Download from Ensembl gene annotation in GFF3 format.
- Download from Ensembl gene sequences in FASTA format.
- Compress and index all sequences files with
bgzip, samtools faidx, and samtools dict.
- Compress and index the annotation files with
bgzip and tabix.
Planned features:
- Gene density sub-workflow.
- Sub-tracks for each biotype.
Read Mapping
bgzip, samtools faidx, and samtools dict.bgzip and tabix.Planned features:
- Gene density sub-workflow.
- Sub-tracks for each biotype.
Read Mapping
sanger-tol/readmapping aligns reads generated using Illumina, HiC, PacBio and Nanopore technologies against a genome assembly.
Current features:
- Align short read data (HiC and Illumina) against the genome with
bwamem2.
- Mark duplicates for short read alignment with
samtools.
- Filter PacBio raw read data using vector database.
- Align long read data (ONT, PacBio CCS and PacBio CLR) against the genome with
minimap2.
- Merge all alignment files at the individual level and convert to CRAM format.
- Calculate statistics for all alignment files using
samtools stats, flagstat, and idxstats.
- Read chunking to speed up alignment for all technologies.
- Rich metadata in aligned file headers.
- Support compression with
crumble for aligned files.
- Support multiple output options – BAM, compressed BAM, CRAM, compressed CRAM.
Planned features:
- Add support for RNAseq data.
- Replace the "sample" samplesheet parameter with one for the specimen and one for the sequencing run.
Genome Note
bwamem2.samtools.minimap2.samtools stats, flagstat, and idxstats.crumble for aligned files.Planned features:
- Add support for RNAseq data.
- Replace the "sample" samplesheet parameter with one for the specimen and one for the sequencing run.
Genome Note
sanger-tol/genomenote generates all the data (tables and figures) used in genome note publications. These include (1) assembly information, statistics and chromosome details, (2) PacBio consensus quality and k-mer completeness, and (3) HiC contact maps and mapping statistics.
Current features:
- Create HiC contact map and chromosomal grid using
Cooler.
- Retrieve assembly information, statistics and chromosome details from NCBI
datasets.
- Compute genome completeness with
BUSCO.
- Compute sequence quality and k-mer completeness with the
FastK/MerquryFK suite of tools, using the alternative haplotype.
- Compute the percentage of HiC primary mappings with
samtools flagstat.
- Run GFA stats.
- Broad fetching of genome metadata from numerous sources.
- Create summary table with the information above.
- Combine results and metadata with template Word document.
- Run gfastats.
Planned features:
- Add optional read mapping subworkflow.
- Generate combined HiC contact maps
- Generate smudgeplots
BlobToolKit
Cooler.datasets.BUSCO.FastK/MerquryFK suite of tools, using the alternative haplotype.samtools flagstat.Planned features:
- Add optional read mapping subworkflow.
- Generate combined HiC contact maps
- Generate smudgeplots
BlobToolKit
sanger-tol/blobtoolkit is used to identify and analyse non-target DNA for eukaryotic genomes.
Current features:
- Calculate sequence statistics in 1kb windows for each contig.
- Count BUSCOs in 1kb windows for each contig using specific and basal lineages.
- Calculate coverage in 1kb windows using
blobtk depth.
- Aggregate 1kb values into windows of fixed proportion (10%, 1% of contig length) and fixed length (100kb, 1Mb).
Diamond blastp search of BUSCO gene models for basal lineages (archaea_odb10, bacteria_odb10 and eukaryota_odb10) against the UniProt reference proteomes.
Diamond blastx search of assembly contigs against the UniProt reference proteomes
- NCBI
blastn search of assembly contigs with no Diamond blastx match against the NCBI nt database
- Optional read mapping subworkflow
- Import analysis results into a BlobDir dataset.
- BlobDir validation and static image generation.
Planned features:
- Compute read coverage with k-mer based methods.
- Accept pre-computed read-coverage and fasta_windows analyses.
Sequence Composition
blobtk depth.Diamond blastp search of BUSCO gene models for basal lineages (archaea_odb10, bacteria_odb10 and eukaryota_odb10) against the UniProt reference proteomes.Diamond blastx search of assembly contigs against the UniProt reference proteomesblastn search of assembly contigs with no Diamond blastx match against the NCBI nt databasePlanned features:
- Compute read coverage with k-mer based methods.
- Accept pre-computed read-coverage and fasta_windows analyses.
Sequence Composition
sanger-tol/sequencecomposition extracts statistics from a genome about its sequence composition.
Current features:
- Run
fasta_windows on the genome FASTA file.
- Extract single-statistics
bedGraph files from the multi-statistics outputs.
- Compress and index all
bedGraph and TSV files with bgzip and tabix.
Planned features:
- Add simple repeat finders:
- Low complexity repeats from
Dustmasker.
- Inverted repeats from
einverted.
- LTR retrotransposons from
LTRharvest and LTRdigest.
- Tandem repeats from
trf.
- Telomeric repeat annotation (tool to be confirmed).
- Centromeric repeat annotation (tool to be confirmed).
- Add comprehensive repeat finders such as EarlGreyTE or EDTA.
- Add TRASH (Tandem Repeat Annotation and Structural Hierarchy) pipeline.
- Add Pantera pipeline.
- Add stainedglass and/or ModDotPlot pipeline.
- Mappability tracks.
Variant Calling
fasta_windows on the genome FASTA file.bedGraph files from the multi-statistics outputs.bedGraph and TSV files with bgzip and tabix.Planned features:
- Add simple repeat finders:
- Low complexity repeats from
Dustmasker. - Inverted repeats from
einverted. - LTR retrotransposons from
LTRharvestandLTRdigest. - Tandem repeats from
trf. - Telomeric repeat annotation (tool to be confirmed).
- Centromeric repeat annotation (tool to be confirmed).
- Low complexity repeats from
- Add comprehensive repeat finders such as EarlGreyTE or EDTA.
- Add TRASH (Tandem Repeat Annotation and Structural Hierarchy) pipeline.
- Add Pantera pipeline.
- Add stainedglass and/or ModDotPlot pipeline.
- Mappability tracks.
Variant Calling
sanger-tol/variantcalling calls (short) variants on PacBio data using DeepVariant.
Current features:
- Can combine multiple libraries from the same sample.
- Optional read mapping subworkflow.
- Calls variants using DeepVariant for PacBio long read data.
- Speed improvements made by splitting the genome before calling variants.
- Outputs both VCF and GVCF formats.
Planned features:
- Add structural variation detection.
- Remove the ability to create bedGraph for distribution of heterozygous sites across genome
as this will be the remit of sanger-tol/variantcomposition.
Variant Composition
Planned features:
- Add structural variation detection.
- Remove the ability to create bedGraph for distribution of heterozygous sites across genome as this will be the remit of sanger-tol/variantcomposition.
Variant Composition
sanger-tol/variantcomposition analyses variant calls generated by sanger-tol/variantcalling or elsewhere.
Note: the pipeline only has a pre-release (v0.1.0). It will be considered ready to general use in v0.2.0.
Current features:
- Statistics.
- Distribution of heterozygous sites across genome.
- Runs of homozygosity.
- InDel size distribution.
Planned features:
- VCF filtering options.
Future pipelines
- Pipeline to run BUSCO (outputs can then be used by blobtoolkit and genomenote).
A preliminary version is implemented in sanger-tol/busco
- Pipeline to download RefSeq annotations.
Planned features:
- VCF filtering options.
Future pipelines
- Pipeline to run BUSCO (outputs can then be used by blobtoolkit and genomenote). A preliminary version is implemented in sanger-tol/busco
- Pipeline to download RefSeq annotations.