Introduction

Workflow input

Parameters summary

Details

Workflow accepts the following parameters:

  • input - (required) YAML file containing description of the dataset, incl. ToLID, paths to the raw data etc.
  • bed_chunks_polishing - a number of chunks to split contigs for polishing (default 100)
  • cool_bin - a bin size for cooler (default 1000)
  • organelles_on - set True for running organelles subworkflow
  • polishing_on - set True for polishing
  • hifiasm_hic_on - set True to run of hifiasm in HiC mode
    NB: hifiasm in the original mode is used as the main assembly even if the hifiasm_hic_on flag is set

Full samplesheet

The input dataset is described in YAML format which states for "Yet Another Markdown Language". It is a human readable file which contains information about location paths for the raw data (HiFi, 10X, HiC) used for the genome assembly. It can also contain meta information such as HiC restriction motifs, BUSCO lineage, mitochondrial code etc. For more information see Input YAML definition

Input YAML definition

  • dataset.id
    • is used as the sample id throughout the pipeline. ToLID should be used in ToL datasets.

  • dataset.illumina_10X.reads
    • is necessary in case polishing is applied, this field should point to the path of the folder containing 10X reads. Sample identifier in the Illumina reads should coincide with the top level ID. For the use of the Longranger software the reads should follow the 10X FASTQ file naming convention.

  • dataset.pacbio.reads
    • contains the list (-reads) of the HiFi reads in FASTA (or gzipped FASTA) format in. The pipeline implementation is based on an assumption that reads have gone through adapter/barcode checks.

  • dataset.HiC.reads
    • contains the list (-reads) of the HiC reads in the indexed CRAM format.

  • dataset.hic_motif
    • is a comma-separated list of restriction sites. The pipeline was tested with the Arima dataset, but it's should be alright to use it with the other HiC libraries
  • dataset.busco.lineage
    • specifies the name of the BUSCO dataset (i.e. bacteria_odb10).

  • dataset.busco.lineage_path
    • is an optional field containing the path to the folder with pre-downloaded BUSCO lineages.

  • dataset.mito.species
    • is the latin name of the species to look for the mitogenome reference in the organelles subworkflow. Normally this parameter will contain the latin name of the species whose genome is being assembled.

  • dataset.mito.min_length
    • sets the minimal length of the mito, can be 15Kb.

  • dataset.mito.code
    • is a mitochondrial code for the mitogenome annotation. See here for reference.

An example of the input YAML

Details

Example is based on test.yaml.

dataset:
  id: baUndUnlc1
  illumina_10X:
    reads:
      - https://tolit.cog.sanger.ac.uk/test-data/Undibacterium_unclassified/genomic_data/baUndUnlc1/10x/baUndUnlc1_S12_L002_R1_001.fastq.gz
      - https://tolit.cog.sanger.ac.uk/test-data/Undibacterium_unclassified/genomic_data/baUndUnlc1/10x/baUndUnlc1_S12_L002_R2_001.fastq.gz
      - https://tolit.cog.sanger.ac.uk/test-data/Undibacterium_unclassified/genomic_data/baUndUnlc1/10x/baUndUnlc1_S12_L002_I1_001.fastq.gz
  pacbio:
    reads:
      - reads: https://tolit.cog.sanger.ac.uk/test-data/Undibacterium_unclassified/genomic_data/baUndUnlc1/pacbio/fasta/HiFi.reads.fasta
  HiC:
    reads:
      - reads: https://tolit.cog.sanger.ac.uk/test-data/Undibacterium_unclassified/genomic_data/baUndUnlc1/hic-arima2/41741_2%237.sub.cram
hic_motif: GATC,GANTC,CTNAG,TTAA
hic_aligner: bwamem2
busco:
  lineage: bacteria_odb10
mito:
  species: Caradrina clavipalpis
  min_length: 15000
  code: 5
  fam: https://github.com/c-zhou/OatkDB/raw/main/v20230921/insecta_mito.fam
plastid:
  fam: https://github.com/c-zhou/OatkDB/raw/main/v20230921/acrogymnospermae_pltd.fam

Extra installation procedures

Longranger

Longranger is a proprietary software product from 10X Genomics. Its terms and conditions state that we cannot redistribute the copy we have in the Tree of Life department.

If you want to run the polising option, you have to install longranger yourself. Go to https://support.10xgenomics.com/genome-exome/software/downloads/latest, read their End User Software License Agreement, and you'll be able to download the software if you accept it.

To make a Docker (or Singularity) container out of it, use the following Dockerfile.

FROM ubuntu:22.04
LABEL org.opencontainers.image.licenses="10x Genomics End User Software License Agreement - https://support.10xgenomics.com/genome-exome/software/downloads/latest"
ARG DEST=/opt
ADD ./longranger-2.2.2.tar.gz $DEST
RUN ln -s $DEST/longranger-2.2.2/longranger /usr/local/bin/

Then, to use the container in the pipeline, write the following to a longranger.config file

process {
    withName: LONGRANGER_MKREF {
        container = "/path/to/longranger_container"
    }

    withName: LONGRANGER_ALIGN {
        container = "/path/to/longranger_container"
    }
}

And pass it to the pipeline with -c longranger.config.

Usage

Local testing

Details

The pipeline can be tested locally using a provided small test dataset:

git clone git@github.com:sanger-tol/genomeassembly.git
cd genomeassembly/
nextflow run main.nf -profile test,singularity --outdir ${OUTDIR} {OTHER ARGUMENTS}

These command line steps will download the pipeline and run the test.

You should now be able to run the pipeline as you see fit.

Running the pipeline

The typical command for running the pipeline is as follows:

nextflow run sanger-tol/genomeassembly --input assets/dataset.yaml --outdir <OUTDIR> -profile docker,sanger

This will launch the pipeline with the docker configuration profile, also using your institution profille if available (see nf-core/configs). See below for more information about profiles.

In case organelles subworkflow is switched on you will also need to set a nextflow secret to store the API key belonging to your user.

  nextflow secrets set TOL_API_KEY '[API key]'

Note that the pipeline will create the following files in your working directory:

work                # Directory containing the nextflow working files
<OUTDIR>            # Finished results in specified location (defined with --outdir)
.nextflow_log       # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.

Updating the pipeline

When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:

nextflow pull sanger-tol/genomeassembly

Reproducibility

It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since.

First, go to the sanger-tol/genomeassembly releases page and find the latest version number - numeric only (eg. 1.3.1). Then specify this when running the pipeline with -r (one hyphen) - eg. -r 1.3.1.

This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future.

Core Nextflow arguments

NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).

-profile

Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.

Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Conda) - see below. When using Biocontainers, most of these software packaging methods pull Docker containers from quay.io e.g FastQC except for Singularity which directly downloads Singularity images via https hosted by the Galaxy project and Conda which downloads and installs software locally from Bioconda.

We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported.

The pipeline also dynamically loads configurations from https://github.com/nf-core/configs when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the nf-core/configs documentation.

Note that multiple profiles can be loaded, for example: -profile test,docker - the order of arguments is important! They are loaded in sequence, so later profiles can overwrite earlier profiles.

If -profile is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH. This is not recommended.

  • docker
    • A generic configuration profile to be used with Docker
  • singularity
    • A generic configuration profile to be used with Singularity
  • podman
    • A generic configuration profile to be used with Podman
  • shifter
    • A generic configuration profile to be used with Shifter
  • charliecloud
    • A generic configuration profile to be used with Charliecloud
  • conda
    • A generic configuration profile to be used with Conda. Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud.
  • test
    • A profile with a complete configuration for automated testing
    • Includes links to test data so needs no other parameters

-resume

Specify this when restarting a pipeline. Nextflow will use cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. For input to be considered the same, not only the names must be identical but the files' contents as well. For more info about this parameter, see this blog post.

You can also supply a run name to resume a specific run: -resume [run-name]. Use the nextflow log command to show previous run names.

-c

Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information.

Custom configuration

Resource requests

Whilst the default requirements set within the pipeline will hopefully work for most people and with most input data, you may find that you want to customise the compute resources that the pipeline requests. Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with any of the error codes specified here it will automatically be resubmitted with higher requests (2 x original, then 3 x original). If it still fails after the third attempt then the pipeline execution is stopped.

To change the resource requests, please see the max resources and tuning workflow resources section of the nf-core website.

nf-core/configs

In most cases, you will only need to create a custom config as a one-off but if you and others within your organisation are likely to be running nf-core pipelines regularly and need to use the same settings regularly it may be a good idea to request that your custom config file is uploaded to the nf-core/configs git repository. Before you do this please can you test that the config file works with your pipeline of choice using the -c parameter. You can then create a pull request to the nf-core/configs repository with the addition of your config file, associated documentation file (see examples in nf-core/configs/docs), and amending nfcore_custom.config to include your custom profile.

See the main Nextflow documentation for more information about creating your own configuration files.

If you have any questions or issues please send us a message on Slack on the #configs channel.

Running in the background

Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished.

The Nextflow -bg flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file.

Alternatively, you can use screen / tmux or similar tool to create a detached session which you can log back into at a later time. Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs).

Nextflow memory requirements

In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in ~/.bashrc or ~./bash_profile):

NXF_OPTS='-Xms1g -Xmx4g'