Edit

Pipeline Review Guidelines

Suggestions for reviewing pipeline pull requests

The aim is to have standardised best-practice pipelines. To ensure this standardisation, we maintain a set of guidelines which all sanger-tol pipelines must adhere to. These are adapted from nf-core guidelines and review checklist.

Pipeline developers are recommended to create modular and small pull requests (PRs) to get the most out of the review process. Think about that before writing the code and opening the pull-request, as breaking down a PR into multiple ones can be tricky. As a rule of thumb, a PR should not add more than one sub-workflow, a sub-workflow should not contain more than ten steps. A PR can modify multiple sub-workflows, as long as the changes are related.

The role of the reviewer is to check for adherence to the central principles of nf-core and sanger-tol (reproducibility, excellent reporting, documented, keeping to the template etc.,). Here we provide a general set of suggestions when doing pipeline reviews:

The instructions below are subject to interpretation and specific scenarios. If in doubt, please ask for feedback.

Do: nf-core principles

All sanger-tol pipelines must follow the following guidelines:

Identity and branding: Primary development must on the sanger-tol organisation.
Workflow size: Not too big, not too small.
Workflow name: Names should be lower case and without punctuation.
Use the template: All sanger-tol pipelines must be built using the nf-core template and sanger-tol branding.
Software licence: Pipelines must be open source, released with the MIT licence.
Bundled documentation: Pipeline documentation must be stored in the repository and viewable on the pipeline website.
Docker support: Software must be bundled using Docker and versioned.
Continuous integration testing: Pipelines must pass CI tests.
Semantic versioning: Pipelines must use stable release tags.
Standardised parameters: Strive to have standardised usage.
Single command: Pipelines should run in a single command.
Keywords: Excellent documentation and GitHub repository keywords.
Pass lint tests: The pipeline must not have any failures in the nf-core lint tests.
Credits and Acknowledgements: Pipelines must properly acknowledge prior work.
Minimum inputs: Pipelines should be able to run with as little input as possible.
Use sanger-tol git branches: Use main, dev and TEMPLATE.

Do: nf-core recommendations

All sanger-tol pipelines should follow the following guidelines, if possible / appropriate:

Use Bioconda: Package software using bioconda and biocontainers.
File formats: Use community accepted modern file formats such as CRAM.
DOIs: Pipelines should have digital object identifiers (DOIs).
Publication credit: Pipeline publications should acknowledge the sanger-tol community and contributing members.

Do: Local code and modules

Do local scripts in bin/ have author and licence embedded?
- We don't need to repeat the licence in our scripts, since the MIT licence is defined in the LICENSE file at the root.
- Check the origin and licensing of any third-party script included in bin/. Pay particular attention to scripts that are licensed under something different from MIT. If in doubt, ask @muffato.
Do all local modules have docker/singularity/conda declarations?
- Are they ideally in bioconda/biocontainers ?
Do all local modules conda/container tool declarations have versions? (and not latest, dev etc.)
Do all local modules report versions (if applicable)?
- Simple modules with e.g. single grep operations not necessary
- It would be good to add with more complex operations such as awk
Should any local modules be in nf-core/modules?

Do: Documentation

Documentation is only on the pipelines website (not pointed to other places, e.g. not readthedocs )
Is documentation sufficiently described (usage.md, output.md, nextflow_schema.json)?
- nextflow_schema.json: check if types are correct and that default and enum are used where applicable
Are there any typos in the documentation (usage.md, output.md, nextflow_schema.json)
Is CHANGELOG sufficiently filled in?
- Check version system is three-point SemVer e.g. 2.1.0
- Has the date been updated?
Check citation formatting consistency in CITATIONS.md
Check that all tools are cited
Check that (all) pipeline author(s) listed themselves in the manifest and other contributors are added in the README

Do: Code

Check no overly non-template components (no readthedocs, entirely custom logo etc.)
Check for general code readability
Check for possible code bugs
Check for consistency in parameters to simplify the user experience
- i.e. snake_case
- All boolean parameters evaluate to false by default or all boolean parameters evaluate to true by default.
- All boolean parameters are named --enable-* or all boolean parameters are named --skip-*, etc.
Check manifest includes DOI (if present) etc.
Check that the only files executable are in the bin/ directory.

Don't have to do

Review module code from nf-core/modules
Comment on scientific content (unless you are familiar with the topic)
Major code optimisation
- You can suggest small code optimisations
- Larger ones you can recommend, but should not necessarily be required for release

If the guidelines don't fit

We appreciate that the above guidelines are relatively rigid and may not always fit. If that's the case, please discuss at one of the pipelines meeting.

We hope that the nf-core best practices, tooling and community are helpful for anyone building Nextflow pipelines, even if they are not a good fit for being listed as official nf-core pipelines.

If a pipeline is found to be violating the standards and guidelines, you should try to address the problems with the pipeline maintainers through discussion. Hopefully the pipeline can then be updated so that it adheres to the guidelines.

All members of the sanger-tol community must adhere to the sanger-tol code of conduct. The guidelines and actions within the code of conduct take precedence over the development guidelines described in this page.

Guidelines

Identity and branding

Link with nf-core

Please don't call your pipeline nf-core/<yourpipeline>, it must be sanger-tol/<yourpipeline>. Please say that your pipeline "uses" nf-core rather than rather than "is" nf-core. When you generate a pipeline with nf-core create, exclude nf-core branding and select custom prefix sanger-tol.

If reviewing a pipeline on an older nf-core version, double-check the occurrences of the keyword nf-core in the repository. Everything that is meant to be about our org must rather be sanger-tol.

Also check the occurrences of nf-co.re. We have our own website https://pipelines.tol.sanger.ac.uk to display all the pipeline documentation.

Development must on the sanger-tol organisation.

All ToL developers have got write access to the sanger-tol repositories so that all development can happen directly there.

Do not fork sanger-tol repositories.

When new pipelines are added to sanger-tol, please transfer ownership to sanger-tol instead of forking it.

If you have already forked your pipeline to sanger-tol, you can email GitHub support and request that they reroute the fork. Alternatively, contact the IT team and we may be able to help.

Disable GitHub features for forks

To encourage contributors to focus on the sanger-tol repository, please disable GitHub issues / wiki / projects on your forked repository. You'll find these options under the GitHub repository settings.

Workflow size

We aim to have a "not too big, not too small" rule. This is deliberately fuzzy, but as a rule of thumb workflows should contain at least three processes and be simple enough to run that a new user can realistically run the pipeline after spending ten minutes reading the docs.

Most pipelines sizes depends on the scope of the work but please consider disk, RAM and other resources. Larger the usage, fewer parallel work jobs can be managed.

Workflow name

All sanger-tol pipelines should be lower case and without punctuation. This is to maximise compatibility with other platforms such as Docker Hub, which enforce such rules. We prefer that they are descriptive towards the data or analysis type the pipeline will be using or performing, and should be approved. In documentation, please refer to your pipeline as sanger-tol/pipeline.

Use the template

All sanger-tol pipelines must be built using the nf-core template with a custom prefix sanger-tol.

Workflows should be started using the nf-core create command which makes a new git repository and the initial commits and branches. This is to ensure that the sync process can work. See the sync docs for details.

Where possible, workflow authors should do their best to follow nf-core conventions for filenames and code locations.

Software licence

All sanger-tol pipelines must be released with an MIT licence. The copyrights belong to Genome Research Ltd. as per Wellcome Sanger Institute policy.

Please try not bundle any third party scripts within the workflow, in case they have a different or incompatible licence (for example, in the bin directory). If you need such a script, even a simple one, try to release it on bioconda or as a container instead and reference it like any other software. If you still decide to bundle the third-party script (software) with the workflow, make sure the licence file is updated accordingly.

Bundled documentation

All documentation must be bundled with the pipeline code in the main repository, within a directory called docs.

Documentation must only be hosted on the GitHub repository, which is automatically synchronised to the pipelines website. Hosting the documentation at a second location (such as custom readthedocs website, or GitHub pages etc) is not allowed. This is to ensure that users of sanger-tol pipelines can always intuitively find the documentation for all sanger-tol pipelines in the same way.

Documentation must include at least the following files:

README.md
docs/usage.md
docs/output.md

Docker support

Pipelines must have all software bundled using Docker - that is, it must be possible to run the pipeline with -profile docker and have all software requirements satisfied.

Tools should use docker images from biocontainers where possible, as using Bioconda / Biocontainers gives support for conda + docker + singularity.

All containers must have specific, stable versions pinned. These should preferably be named after a software release, but it can also be by commit or some other identifier. Software versions must be static and stable. Labels such as latest, dev, master and so on are not reproducible over time and so not allowed.

Continuous integration testing

Pipelines must have automated continuous integration testing, running using GitHub Actions. There must be a small dataset that can be tested on GitHub directly, and a larger one that can be tested on the Sanger farm using Seqera Platform.

There must be a config profile called test that should be as comprehensive as possible - that is, it should run as much of the pipeline as possible. It should use as tiny test data set as possible (even if the output that it creates is meaningless).

Then, we configure the integration with Seqera Platform to allow testing the larger dataset (test_full) on the Sanger LSF farm. To set up that up, first add the profile cleanup { cleanup = true } to your nextflow.config (right at the beginning of the profiles section). This is to control the amount of space taken on Lustre. Then, copy the two files sanger_test.yml and sanger_test_full.yml to your .github/workflows/. Ask @muffato to enable the Seqera Platform integration for your repository.

Semantic versioning

Pipelines must be released with stable release tags. Releases must use GitHub releases and keep a detailed changelog file.

Release version tags must be numerical only (no v prefix) and should follow semantic versioning rules: [major].[minor].[patch]

For example, starting with with a release version 1.4.3, bumping the version to:

1.4.4 would be a patch release for minor things such as fixing bugs.
1.5.0 would be a minor release, for example adding some new features, but still being backwards compatible.
2.0.0 would correspond to the major release where results would no longer be backwards compatible.

Standardised parameters

Where possible pipelines should use the same command line option names as other pipelines for comparable options. For example, --input and --fasta.

In addition to the names of parameters, they should ideally work in a similar way. For example, --input typically takes a .csv sample sheet file (but not always, where not appropriate).

nf-core are planning to build a tool that lists every parameter used by every pipeline, so that you can check for existing parameters with similar names You can track the progress of this feature request here: nf-core/website#1251 Once comple, we will try to implement this for sanger-tol pipelines as well.

Single command

Every sanger-tol pipeline repository must contain a single pipeline. That is, there should be a main.nf file that is the single way to launch a pipeline.

It is ok to have multiple 'tracks' within the pipeline, selectable with configuration options.
It is ok to have workflows that use the output of another nf-core pipeline as input

It should be possible to run all parts of the workflow using nextflow run sanger-tol/<pipeline>, without any specific .nf filename.

Keywords

Pipelines should have excellent documentation.

Repositories should have GitHub Topics set on the sanger-tol repository. These are then shown on the pipelines website and used for categorisation and searching. They are important for workflow visibility and findability.

Topics can be related to the workflow, tools, data or any related aspect.

Pass lint tests

In order to automate and standardise the nf-core best practices, there is a code linting tool. These tests are run by the nf-core/tools package. The nf-core lint command must be run by continuous integration tests on GitHub Actions and must pass for each pull request and before release.

You can see the list of tests and how to pass them on the error codes page.

In some exceptional circumstances, it is ok to ignore certain tests using nf-core.yml. If that's the case, please discuss first at one of the pipelines meeting.

Credits and Acknowledgements

Please acknowledge all developers, reviewers and anyone who has contributed to the finished product (even through verbal discussions and suggestions). Everyone should be credited in alphabetical order by last name. This is independent of authorship in any associated manuscript.

Where previous work from other pipelines / projects is used within a pipeline, the original author(s) must be properly acknowledged. Some examples on how you could do that to make sure they feel valued:

Send them a message via Slack and let them know that you use their work and had to change something to fit your own purpose. If in doubt, check with them to see how they would like to be acknowledged.
Check the licence of their code and/or graphics components and make sure you obey the rules that this licence imposes (e.g. CC-BY means you have to attribute the original creator).
If you use portions of pipeline code, even if its just tiny pieces:
- Link to the original repository and/or authors.
- Leave existing credits and acknowledgement sections intact - there may be more than just a single author involved.
If you find bugs / issues, report and fix them upstream in the main project.

If in doubt about what to do, ask on Slack or discuss at the fortnightly pipeline meetings.

To accurately record all contributions, Nextflow now supports a contributors array in the manifest section of nextflow.config. Fill it in, and use the two scripts /software/treeoflife/bin/generate_cff_from_manifest.py and /software/treeoflife/bin/generate_rocrate_from_manifest.py to update CITATION.cff and ro-crate-metadata.json accordingly.

When reviewing a release pull-request, check that all files are synchronised.

Minimum inputs

Pipelines can accept as many input files as you like, but it should be possible to run with as few as possible.

For example, pipelines should auto-generate missing reference files, where possible. So given a reference genome Fasta file a pipeline would build the reference index files. The pipeline should also be able to optionally accept the reference index files in this case, if available.

Git branches

The latest stable release should be on the main main branch. No additional changes should be pushed to master after each release.

The main development code should be kept in a branch called dev. The sanger-tol dev branch should be set as the default branch up until the first release.

For minor bugfixes a patch branch may be used and merged directly into main, leaving dev for continued development work.

The TEMPLATE branch should only contain vanilla nf-core template code. It is used for automated synchronisation of template updates.

If reviewing a pipeline on an older nf-core version, double-check the occurrences of the keyword master in the repository.

GitHub Actions workflows (under .github/) may still contain master but always with main as well. The only exceptions is those three files that we ask to update during the pipeline creation.
modules.json can refer to master because that's the branch used in nf-core/modules.
nextflow.config and nextflow_schema.json also refer to master as the branch used in nf-core/configs.

Use Bioconda

All pipeline software should be packaged using bioconda. Bioconda packages are automatically available as Docker and Singularity images through biocontainers.

File formats

Pipelines should work with best practice modern file formats, as accepted by the community.

Where possible, genomics pipelines should generate CRAM alignment files by default, but have a --bam option to generate BAM outputs if required by the user.

DOIs

Pipelines must have digital object identifiers (DOIs) for easy referencing in literature.

Typically each release should have a DOI generated by Zenodo. This can be automated through linkage with the GitHub repository.

Cloud compatible

Pipelines should have explicit support for running in cloud environments. The pipelines created with nf-core template comes with all required code to support this setup.

Publication credit

sanger-tol

Pipeline publications should acknowledge all developers, reviewers and anyone who has contributed to the finished product (even through verbal discussions and suggestions). This can be done by either listing people by name or as a collective.

nf-core

At a minimum, the nf-core community should be thanked in the acknowledgment section.

We would like to thank the nf-core community for developing the nf-core infrastructure and resources for Nextflow pipelines. A full list of nf-core community members is available at https://nf-co.re/community.

Optionally, the nf-core community can be included as a consortium co-author of the publication (example). This route is a good idea when a pipeline has used extensive existing nf-core pipeline infrastructure (e.g. modules) that were written by other members of the community not directly involved in the pipeline itself.

If any members of the nf-core community have provided significant input to the creation of the pipeline, please consider adding them as coauthors on the paper directly.

If in doubt, contact the core team for guidance.