The Tree of Life projects will generate tens of thousands of high-quality genomes – more than have ever been sequenced! It is a challenging and extremely exciting task that will shape the future of biology, and the team’s role is to provide the platform for assembling and analysing those genomes at an unprecedented scale. We are the interface between the Tree of Life teams (assembly production and faculty research) and Sanger IT, working together with the informatics teams of the other programmes.
The team is organised in three poles.
📂 Data management: Our data curators and managers maintain the integrity, consistency, and quality, or multiple databases used in production, including Genomes on a Tree (GoaT), Sample Tracking System (STS), Collaborative Open Plant Omics (COPO), and BioSamples.
💻 Bioinformatics: Our bioinformaticians develop the suite of analysis pipelines that will run on every genome produced in Tree of Life, providing a central database of core results available for all.
🔩 Systems: We develop and maintain some core systems used in production, including the execution and tracking of all bioinformatics pipelines, and the deployment of third-party web applications for internal use.
The team uses a wide range of technologies, frameworks and programming languages, including Nextflow, Python, Conda, Jira, LSF, Singularity, and Kubernetes. The technology wheel below shows most of their logos. How many can you recognise?
Genome After Party
Genome After Party is a suite of pipelines to standardise the downstream analyses performed on all genomes produced by the Tree of Life. These include:
- sanger-tol/insdcdownload downloads assemblies from INSDC into a Tree of Life directory structure.
- sanger-tol/ensemblrepeatdownload downloads repeat annotations from Ensembl into a Tree of Life directory structure.
- sanger-tol/ensemblgenedownload downloads gene annotations from Ensembl into the Tree of Life directory structure.
- sanger-tol/sequencecomposition extracts statistics from a genome about its sequence composition.
- sanger-tol/readmapping for aligning reads generated using Illumina, HiC, PacBio and Nanopore technologies against a genome assembly.
- sanger-tol/variantcalling for calling variants using DeepVariant with PacBio data.
- sanger-tol/blobtoolkit is used to identify and analyse non-target DNA for eukaryotic genomes.
- sanger-tol/genomenote creates HiC contact maps and collates (1) assembly information, statistics and chromosome details, (2) PacBio consensus quality and k-mer completeness, and (3) HiC mapping statistics.
A portal is being developed to automate the production of genome note publications. It will execute the Nextflow pipeline and populate an associated database with generated statistics and images. The platform is being designed in collaboration with the Enabling Platforms team to create genome note style publications for both internal Tree of Life assemblies as well as external genome assemblies.
If you have an idea for a new feature – send us your request.
Matthieu Muffato, Team Lead
Matthieu leads the Informatics Infrastructure team, which guides the implementation and delivery of the genome assembly pipelines, and provides support for large-scale genome analyses for the Tree of Life faculty teams. He joined the Wellcome Sanger Institute in February 2021, to form the Informatics Infrastructure team for the Tree of Life programme. He has recruited 7 team members, with skills covering data curation & management, software development & operations, and bioinformatics.
Guoying Qi, DevOps Software Developer
Guoying, a DevOps software engineer, has the responsibility of developing and deploying software and web applications for the Tree of Life project across various platforms such as computing farms, Kubernetes, OpenStack, and public clouds.
Priyanka Surana, Senior Bioinformatician
Priyanka is a Senior Bioinformatician, overseeing the development of Nextflow pipelines for genome assembly, curation and downstream analyses. She also facilitates the workflows community and is passionate about building networks that support peer learning.
Cibin Sadasivan Baby, Senior Software Developer
Cibin, a Senior Software Developer, is tasked with designing and implementing the production systems for TOL-IT. Currently, Cibin is focused on building an automated platform to execute high-throughput genomic pipelines. The ultimate goal of this project is to develop a system capable of efficiently processing large amounts of genomic data.
Cibele Sotero-Caio, Genomic Data Curator
Cibele is the data curator for the Genomes on a Tree (GoaT) - a platform developed to support the Tree of Life and other sequencing initiatives of the Earth Biogenome project (EBP).
Paul Davis, Data Manager
Paul works on the main ToL Genome Engine. This system was developed by the ToL to manage and track samples from collection, onboarding, processing in the lab, sequencing and finally the publication of the assembly and Genome Note publication. As there are many steps in this process developing methodology to identify issues as early as possible is vital to avoid wasted time and resource. Paul works at all levels of the project fielding questions about data flow, data fixes and helps other ToL staff and project stakeholders with data and information. Paul also interacts with external groups and stakeholders to maintain data integrity in the public domain.
Beth Yates, Bioinformatics Engineer
Beth is a Bioinformatics Engineer working on a building a platform to automate the production of Genome Note publications. The Universal Genome Note platform consists of a web portal, database and Nextflow pipelines. Beth is contributing to the genomenote pipeline, this pipeline fetches assembly meta data and generates some of the figures and statistics included in each genome note.
- Zaynab Butt, Informatics and Digital Associate