Edit

sanger-tol/
treeval

Pipelines for the production of Treeval data

curation genome-alignment genome-assembly genomics quality-control synteny

These pages are for an old version of the pipeline (v1.0.0). The latest stable release is v1.3.0

Launch version 1.0.0

https://github.com/sanger-tol/treeval

Introduction

The TreeVal pipeline has a few requirements before being able to run:

The gene_alignment_data (where this refers to the alignment.data_dir and alignment.geneset data noted in the yaml, which we will explain later) and synteny data much follow a particular directory structure
HiC CRAM files much already be pre-indexed in the same location as the CRAM file, e.g., samtools index {cram file}. If this would be more helpful to the community as an automated process then please submit an issue.
Finally, the yaml file which is described below in Full Samplesheet. This needs to contain all of the information related to the assembly for the pipeline to run.

Prior to running TreeVal

Please ensure you read the following sections on Directory Structure (gene_alignment_data, synteny, scripts), HiC data prep and Pacbio data prep. Without these you may not be able to successfully run the TreeVal pipeline. If nothing is clear then leave an issue report.

Local testing

Details

We provide a complete set of test databases that can be used to test the pipeline locally.

First, choose a download location ${TREEVAL_TEST_DATA} and run this command:

cd ${TREEVAL_TEST_DATA}
curl https://tolit.cog.sanger.ac.uk/test-data/resources/treeval/TreeValTinyData.tar.gz | tar xzf -
sed -i "s|/home/runner/work/treeval/treeval|${TREEVAL_TEST_DATA}|" TreeValTinyData/gene_alignment_data/fungi/csv_data/LaetiporusSulphureus.gfLaeSulp1-data.csv

Then, modify the configuration file to point at that download location and off you go:

sed -i "s|/home/runner/work/treeval/treeval|${TREEVAL_TEST_DATA}|" assets/github_testing/TreeValTinyTest.yaml
nextflow run . -profile test_github,singularity

Directory Structure

Details

Working example found below in the Gene alignment and synteny section (below), this will cover setting up synteny and gene_alignment_data directories as well as downloading some example data.

These two sub-workflows, for now, need the use of the variables classT, synteny_genome_path, data_dir and geneset. These variables are found inside the yaml ( this is the file that will tell TreeVal what and where everything is ). Currently, we don't use common_name, e.g., bee, wasp, moth, etc. However, we hope to make use of it in the future as our gene_alignment_data "database" grows.

First, you should set up a directory in our recommended structure:

treeval-resources
│
├─ gene_alignment_data/
│ ├─ { classT }
│ ├─ csv_data
│ │ └─ { Name.Assession }-data.csv # Generated by our scripts
│ ├─ { Name } # Here and below is generated by our scripts
│ ├─ { Name.Assession }
│ ├─ cdna
│ │ └─ { Chunked fasta files }
│ ├─ rna
│ │ └─ { Chunked fasta files }
│ ├─ cds
│ │ └─ { Chunked fasta files }
│ ├─ peps
│ └─ { Chunked fasta files }
│
├─ gene_alignment_prep/
│ ├─ scripts/ # We supply these in this repo
│ ├─ raw_fasta/ # Storing your fasta downloaded from NCBI or Ensembl
│ └─ treeval-datasets.tsv # Organism, common_name, clade, family, group, link_to_data, notes
│
├─ synteny/
│ └─ {classT}
│
├─ treeval_yaml/ # Storage folder for you yaml files, it's useful to keep them
│
└─ treeval_stats/ # Storage for you treeval output stats file whether for upload to our repo

classT can be your own system of classification, as long as it is consistent. At Sanger we use the below, we advise you do too. Again, this value, that is entered into the yaml (the file we will use to tell TreeVal where everything is), is used to find gene_alignment_data as well as syntenic genomes.

ClassT

Synteny

Details

For synteny, below the classT variable you should store the full genomic fasta file of any high quality genome you want to be compared against.

For bird we recommend the Golden Eagle ( Aquila chrysaetos ) and the Zebrafinch (Taeniopygia guttata), which can be downloaded from NCBI. Rename, these files to something more human readable, and drop them into the synteny/bird/ folder. Any TreeVal run you now perform where the classT is bird will run a syntenic alignment against all genomes in that folder. It would be best to keep this to around three. Again, this is something we could expand on with the common_name field if people want in the future, submit a feature request.

Gene Alignment and Synteny Data and Directories

Details

Seeing as this can be quite a complicated to set up here's a walk through.

Step 1 - Set up the directories

Lets set up the system as if we want to run it on a bird genome.

mkdir -p gene_alignment_prep/scripts/

cp treeval/bin/treeval-dataprep/* gene_alignment_prep/scripts/

mkdir -p gene_alignment_prep/raw_fasta/

mkdir -p gene_alignment_data/bird/csv_data/

mkdir -p synteny/bird/

The naming of the bird folder here is important, keep this in mind.

So now we have this structure:

~/treeval-resources
    │
    ├─ synteny/
    │   └─ bird/
    │
    ├─ gene_alignment_data/
    │   └─ bird/
    │         └─ csv_data/
    │
    └─ gene_alignment_prep/
        ├─ scripts/
        └─ raw_fasta/

Step 2 - Download some data

First, let's download out sytenic alignment data. I think the Zebrafinch (Taeniopygia guttata) would be good against the Chicken (Gallus gallus).

cd  synteny/bird/

curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/957/565/GCA_003957565.4_bTaeGut1.4.pri/GCA_003957565.4_bTaeGut1.4.pri_genomic.fna.gz -o bTaeGut1_4.fasta.gz

gunzip bTaeGut1_4.fasta.gz

This leaves us with a file called bTaeGut1_4.fasta the genomic assembly of bTaeGut1_4 (the Tree of Life ID) for this species) also known as Taeniopygia guttata, the Australian Zebrafinch.

Now lets move into the raw_data folder and download some data, this may take some time.

cd  ../../gene_alignment_prep/raw_data/

curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_cds_from_genomic.fna.gz -o GallusGallus-GRCg7b.cds.fasta.gz

curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_genomic.fna.gz -o GallusGallus-GRCg7b.cdna.fasta.gz

curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_protein.faa.gz -o GallusGallus-GRCg7b.pep.fasta.gz

curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_rna.fna.gz -o GallusGallus-GRCg7b.rna.fasta.gz

Now that's all downloaded we need to prep it. At this point it is all still gzipped (the .gz on the end denotes that the file is compressed) in this format we can't use it. So lets use some bash magic.

This is a for loop written in bash, this will look through the current folder we are in for files ending with .fasta.gz and then gunzip them. This unzips the file, uncompresses it so it is usable, and then runs our python script, GA_data_prep.py.

for i in *.fasta.gz; do
gunzip $i;
python3 GA_data_prep.py ${i/.gz} ncbi 10;
done

  <p>
  This python script command here, in English, means. Take the file, uncompressed, that I downloaded from NCBI and cut it into pieces. A Fasta file looks something like this, with headers (lines starting with `&gt;`) and sequence (the usual ATGC's):
  </p>
  <pre><code>
&gt; SCAFFOLD_1_DATA_ON_METHODS
&gt; ATGCGCATGCATGCATCGACTTCGAGCATCGTAG
&gt; SCAFFOLD_2_DATA_ON_METHODS
&gt; ACCAGTGCTAGCTAGCTACGTGTGGGTTTGCCCCGTTT

The headers here will be trimmed, to only essential data that you need in order to fine the sequence in your database of choice.

Fasta file may be made up of anywhere between 10's to many thousands of these. So in the case of our cdna and pep files they need to be cut up to let TreeVal have a chance in reading them all in a small time frame. cds and rna files will be cut up into 1,000 header-sequence pairs per file. The number given on the command-line is ignored. pep and cdna will be cut up by a number you give or by 100. This is because the size of pep and cdna files are so much larger.

The smaller the number you chunk a file, the smaller the files you produce which means you will also make many more files so there is a trade off.

So the following script will produce a large amount of output in your terminal. Looking like:

python3 ../scripts/GA_data_prep.py GallusGallus-GRCg7b.cds.fasta ncbi 100

Your using at least Version 3.6, You are good to go...
os imported
argparse imported
regex imported
WORKING ON: cds--GallusGallus-GRCg7b
Records per file: 1000
Entryfunction called
GallusGallus-GRCg7b.cds.fasta
File found at %s GallusGallus-GRCg7b.cds.fasta
Renaming headers
Read_fasta called
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus1000cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus2001cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus3002cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus4003cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus5004cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus6005cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus7006cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus8007cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus9008cds.MOD.fa

This is pretty much telling us that, yes you have given me a file and for every 1000 (i'm ignoring the number you gave me because this isn't a pep or cdna file) header, sequence pairs I have come across I have made a new file found here. You'll notice that it has also generated a new set of folders. This is based off of how we have named the file.

If you now type ls you should see the files we have downloaded (GallusGallus-GRCg7b.cds.fasta) and the folder GallusGallus. This folder we can now move to its permanent home.

mv GallusGallus/ ../../gene_alignment_data/bird/

Step 3 -- Generate the CSV

This file will act as an index of all files we have produced in the gene_alignment_data folder, and thankfully is a very simple step.

cd ../
python3 scripts/GA_csv_gen.py /path/to/gene_alignment_data/

Running this will look like:

============&gt; CorvusMon1.bCorMon1 -- bird
Generating CSV for:     CorvusMon1.bCorMon1
Save Path:              /gene_alignment_data/bird/csv_data/CorvusMon1.bCorMon1-data.csv
============&gt; CorvusMoneduloides.bCorMon1 -- bird
Generating CSV for:     CorvusMoneduloides.bCorMon1
Save Path:              /gene_alignment_data/bird/csv_data/CorvusMoneduloides.bCorMon1-data.csv
============&gt; Gallus_gallus.UW_022020 -- bird
Generating CSV for:     Gallus_gallus.UW_022020
Save Path:              /gene_alignment_data/bird/csv_data/Gallus_gallus.UW_022020-data.csv
============&gt; Gallus_gallus.GRCg6a -- bird
Generating CSV for:     Gallus_gallus.GRCg6a
Save Path:              /gene_alignment_data/bird/csv_data/Gallus_gallus.GRCg6a-data.csv
============&gt; GallusGallus.GRCg7b -- bird
Generating CSV for:     GallusGallus.GRCg7b
Save Path:              /gene_alignment_data/bird/csv_data/GallusGallus.GRCg7b-data.csv

So what is happening is that it is moving through the directory, identifying each unique folder and generating a CSV summarising the data found in those directories into a csv with the following information, this looks like:

head -n 5 /gene_alignment_data/bird/csv_data/Gallus_gallus.GRCg6a-data.csv

org,type,data_file
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus9002cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus28453cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus18005cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus6001cds.MOD.fa

This is all useful for the pipeline which generates job ids based on the org column, groups files by org and type columns and then pulls data from the data file.

Step 4 -- Understand where we are at

So we have now generated the directory structure for gene_alignment_data. So now lets use what we know to fill out the yaml.

The yaml is a file that we need in order to tell the pipeline where everything is, an example can be found here.

Here we can see a number of fields that need to be filled out, the easiest being synteny_genome_path and data_dir. These refer to the directories we made earlier so we can replace them as such:

alignment:
  data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/

synteny_genome_path: /FULL/PATH/TO/treeval-resources/synteny

I said earlier that the the fact we called a folder bird was important, this is because it now becomes our classT:

classT: bird

During the running of the pipeline, this is appended onto the end of data_dir and synteny_genome_path in order to find the correct files to use. So now all of the files inside /FULL/PATH/TO/treeval-resources/synteny/bird/ will be used for syntenic alignments. Likewise with our alignment.data_dir, TreeVal will turn this into /FULL/PATH/TO/treeval-resources/gene_alignment_data/bird/ and then appends csv_data.

In Step 3, we generated some files which will be living in our /FULL/PATH/TO/treeval-resources/gene_alignment_data/bird/csv_data/ folder and look like GallusGallus.GRCg7b-data.csv. These (minus the -data.csv) will be what we enter into the geneset field in the yaml. The common_name is a field we don't currently use.

alignment:
  data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/
  common_name: "" # For future implementation (adding bee, wasp, ant etc)
  geneset: "GallusGallus.GRCg7b"

However, what is cool about this field is that you can add as many as you want. So say you have the alignment.data_dir for the Finch saved as TaeniopygiaGuttata.bTaeGut1_4. The geneset field becomes: geneset: "GallusGallus.GRCg7b,TaeniopygiaGuttata.bTaeGut1_4"

Hopefully this explains things a bit better and you understand how this sticks together!

HiC data Preparation

Details

Illumina HiC read files should be presented in an unmapped CRAM format, each must be accompanied by an index file (.crai) generated by samtools index. If your unmapped HiC reads are in FASTQ format, you should first convert them to CRAM format by using samtools import methods. Examples are below:

Conversion of FASTQ to CRAM

samtools import -@8 -r ID:{prefix} -r CN:{hic-kit} -r PU:{prefix} -r SM:{sample_name} {prefix}_R1.fastq.gz {prefix}_R2.fastq.gz -o {prefix}.cram

Indexing of CRAM

samtools index {prefix}.cram

PacBio Data Preparation

Details

Before running the pipeline data has to be in the fasta.gz format. Because of the software we use this data with it must also be long-read data as well as single stranded. This means you could use ONT too (except duplex reads).

The below commands should help you convert from mapped bam to fasta.gz, or from fastq to fasta.

If your data isn't already in these formats, then let us know and we'll see how we can help.

BAM -> FASTQ

This command iterates through your bam files and converts them to fastq via samtools.

cd { TO FOLDER OF BAM FILES }
mkdir fastq
for i in *bam
do
  echo $i
  j=${i%.bam}
  echo $j
  samtools bam2fq ${i} &gt; fastq/${j}.fq
done

FASTQ -> FASTA

This command creates a fasta folder (to store our fasta files), moves into the fastq folder and then converts fastq to fasta using seqtk seq.

mkdir fasta
cd fastq
for i in *fq; do
  echo $i
  j=${i%.fq}
  echo $j
  seqtk seq -a $i &gt; ../fasta/${j}.fasta
done

FASTA -> FASTA.GZ

This simply gzips the fasta files.

for i in .fasta; do
  echo $i
  gzip $i
done

Or if you're a command line ninja

samtools bam2fq {prefix}.bam| seqtk seq -a - | gzip - &gt; {prefix}.fasta.gz

Pretext Accessory File Ingestion

Details

Note: This will require you to install bigwigToBedGraph from the ucsc package. Instructions on downloading this can be found at EXAMPLE #3

The PreText files generated by the pipeline are not automatically ingested into the pretext files. For this you must use the following code:

cd {outdir}/hic_files

bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"

bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"

cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"

cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i { your.pretext } -n "gap"

Full samplesheet

YAML is "Yet Another Markdown Language", it is a human-readable format that we use to tell TreeVal a number of things. This includes; assembly location, telomere motif, pacbio data files (in fasta.gz format) and HiC cram files. The full Yaml is detailed below.

YAML contents

The following is an example YAML file we have used during production: nxOscDF5033.yaml and is shown below. This contains some annotations we believe to be helpful, information on the alignment, synteny, pacbio and hic are explained here: pacbio, genealignment and synteny. nxOscDF5033.yaml

assembly
- sample_id: ToLID of the sample.
- latin_name: Latin identification of species
- classT: Clade name (as used to group synteny sequences and to complete alignment/data_dir).
- TicketType:
reference_file: Sample .fa file
assem_reads
- pacbio: path to folder containing fasta.gz files.
- hic: path to folder containing cram files
- supplementary: #Will be required in future development.
alignment
- data_dir: Gene alignment data path
- common_name: # For future implementation (adding bee, wasp, ant etc)
- geneset: a csv list of geneset data to be used
self_comp
- motif_len: Length of motif to be used in self complementary sequence finding
- mummer_chunk: Size of chunks used by MUMMER module.
synteny
- synteny_genome_path: Path to syntenic genomes grouped by clade.
outdir: Will be required in future development.
intron:
- size: base pair size of introns default is 50k
telomere:
- teloseq: Telomeric motif
busco
- lineages_path: path to folder above lineages folder
- lineage: Example is nematode_odb10

Notes on using BUSCO

The pipeline requires the use of the BUSCO subworkflows. Create the database directory and move into the directory:

DATE=2023_03
BUSCO=/path/to/databases/busco_${DATE}
mkdir -p $BUSCO
cd $BUSCO

Download BUSCO data and lineages to allow BUSCO to run in offline mode.

wget -r -nH https://busco-data.ezlab.org/v5/data/

The trailing slash after data is important. Otherwise wget doesn't download the subdirectories, which make up the database.

Tar gunzip all folders that have been stored as tar.gz, in the same parent directories as where they were stored:

find v5/data -name "*.tar.gz" | while read -r TAR; do tar -C `dirname $TAR` -xzf $TAR; done

If you have GNU parallel installed, you can also use the command below which will run faster as it will run the decompression commands in parallel:

find v5/data -name "*.tar.gz" | parallel "cd {//}; tar -xzf {/}"

Sub-workflows

YAML_INPUT
- Reads the input yaml and generates parameters used by other workflows.
GENERATE_GENOME
- Builds genome description file of the reference genome.
LONGREAD_COVERAGE
- Produces read coverage based on pacbio long read fasta file.
GAP_FINDER
- Identifies contig gaps in the input genome.
REPEAT_DENSITY
- Reports the intensity of regional repeats within an input assembly.
HIC_MAPPING
- Aligns illumina HiC short reads to the input genome, generates mapping file in three format for visualisation: .pretext, .hic and .mcool
TELO_FINDER
- Find a user given motif in the input genome.
GENE_ALIGNMENT
- Aligns the peptide and nuclear data from assemblies of related species to the input genome.
INSILICO_DIGEST
- Generates a map of enzymatic digests using 3 Bionano enzymes.
SELFCOMP
- Identifies regions of self-complementary sequence.
SYNTENY
- Generates syntenic alignments between other high quality genomes.
BUSCO_ANALYSIS
- Uses BUSCO to identify ancestral elements. Also use to identify ancestral Lepidopteran genes (merian units).

Running the pipeline

The typical command for running the pipeline is as follows:

nextflow run sanger-tol/treeval --input assets/treeval.yaml --outdir <OUTDIR> -profile singularity, sanger

With the treeval.yaml containing the information from the above YAML Contents section

Updating the pipeline

When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:

nextflow pull sanger-tol/treeval

Reproducibility

It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since.

First, go to the sanger-tol/treeval releases page and find the latest version number - numeric only (eg. 1.3.1). Then specify this when running the pipeline with -r (one hyphen) - eg. -r 1.3.1.

This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future.

Core Nextflow arguments

NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).

`-profile`

Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.

Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Conda) - see below. When using Biocontainers, most of these software packaging methods pull Docker containers from quay.io e.g FastQC except for Singularity which directly downloads Singularity images via https hosted by the Galaxy project and Conda which downloads and installs software locally from Bioconda.

We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported.

The pipeline also dynamically loads configurations from https://github.com/nf-core/configs when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the nf-core/configs documentation.

Note that multiple profiles can be loaded, for example: -profile test,docker - the order of arguments is important! They are loaded in sequence, so later profiles can overwrite earlier profiles.

If -profile is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH. This is not recommended.

docker
- A generic configuration profile to be used with Docker
singularity
- A generic configuration profile to be used with Singularity
podman
- A generic configuration profile to be used with Podman
shifter
- A generic configuration profile to be used with Shifter
charliecloud
- A generic configuration profile to be used with Charliecloud
conda
- A generic configuration profile to be used with Conda. Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud.
test
- A profile with a complete configuration for automated testing
- Includes links to test data so needs no other parameters

`-resume`

Specify this when restarting a pipeline. Nextflow will use cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. For input to be considered the same, not only the names must be identical but the files' contents as well. For more info about this parameter, see this blog post.

You can also supply a run name to resume a specific run: -resume [run-name]. Use the nextflow log command to show previous run names.

`-c`

Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information.

Custom configuration

Resource requests

Whilst the default requirements set within the pipeline will hopefully work for most people and with most input data, you may find that you want to customise the compute resources that the pipeline requests. Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with any of the error codes specified here it will automatically be resubmitted with higher requests (2 x original, then 3 x original). If it still fails after the third attempt then the pipeline execution is stopped.

To change the resource requests, please see the max resources and tuning workflow resources section of the nf-core website.

nf-core/configs

In most cases, you will only need to create a custom config as a one-off but if you and others within your organisation are likely to be running nf-core pipelines regularly and need to use the same settings regularly it may be a good idea to request that your custom config file is uploaded to the nf-core/configs git repository. Before you do this please can you test that the config file works with your pipeline of choice using the -c parameter. You can then create a pull request to the nf-core/configs repository with the addition of your config file, associated documentation file (see examples in nf-core/configs/docs), and amending nfcore_custom.config to include your custom profile.

See the main Nextflow documentation for more information about creating your own configuration files.

If you have any questions or issues please send us a message on Slack on the #configs channel.

Running in the background

Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished.

The Nextflow -bg flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file.

Alternatively, you can use screen / tmux or similar tool to create a detached session which you can log back into at a later time. Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs).

Nextflow memory requirements

In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in ~/.bashrc or ~./bash_profile):

NXF_OPTS='-Xms1g -Xmx4g'