Edit

sanger-tol/
insdcdownload

Nextflow DSL2 pipeline to download assemblies from INSDC.

These pages are for an old version of the pipeline (v1.0.0). The latest stable release is v2.0.2

https://github.com/sanger-tol/insdcdownload

Introduction

This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

The directories comply with Tree of Life's canonical directory structure.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Assembly files - Assembly files, either straight from the NCBI FTP, or indices built on them
Primary analysis files - Files corresponding to analyses run (by the NCBI) on the original assembly, e.g repeat masking
Pipeline information - Report metrics generated during the workflow execution

Assembly files

Here are the files you can expect in the assembly/ sub-directory.

assembly
└── release
    └── gfLaeSulp1.1
        └── insdc
            ├── GCA_927399515.1.assembly_report.txt
            ├── GCA_927399515.1.assembly_stats.txt
            ├── GCA_927399515.1.chrom_sizes
            ├── GCA_927399515.1.fasta.dict
            ├── GCA_927399515.1.fasta.gz
            ├── GCA_927399515.1.fasta.gz.fai
            └── GCA_927399515.1.fasta.gz.gzi

The directory structure includes the assembly name, e.g. gfLaeSulp1.1, and all files are named after the assembly accession, e.g. GCA_927399515.1.

GCA_*.assembly_report.txt and GCA_*.assembly_stats.txt: report and statistics files, straight from the NCBI FTP
GCA_*.fasta.gz: Unmasked assembly in Fasta format, compressed with bgzip (whose index is GCA_*.fasta.gz.gzi)
GCA_*.fasta.gz.fai: samtools faidx index, which allows accessing any region of the assembly in constant time
GCA_*.fasta.dict: samtools dict index, which allows identifying a sequence by its MD5 checksum
GCA_*.chrom_sizes: Tabular file with the size of all sequences in the assembly. Typically used to build "big" files (bigBed, etc).

Primary analysis files

Here are the files you can expect in the analysis/ sub-directory.

analysis
└── gfLaeSulp1.1
    └── repeats
        └── ncbi
            ├── GCA_927399515.1.masked.ncbi.bed.gz
            ├── GCA_927399515.1.masked.ncbi.bed.gz.gzi
            ├── GCA_927399515.1.masked.ncbi.bed.gz.tbi
            ├── GCA_927399515.1.masked.ncbi.fasta.dict
            ├── GCA_927399515.1.masked.ncbi.fasta.gz
            ├── GCA_927399515.1.masked.ncbi.fasta.gz.fai
            └── GCA_927399515.1.masked.ncbi.fasta.gz.gzi

They all correspond to the repeat-masking analysis run by the NCBI themselves. Like for the assembly/ sub-directory, the directory structure includes the assembly name, e.g. gfLaeSulp1.1, and all files are named after the assembly accession, e.g. GCA_927399515.1.

GCA_*.masked.ncbi.fasta.gz: Masked assembly in Fasta format, compressed with bgzip (whose index is GCA_*.fasta.gz.gzi)
GCA_*.masked.ncbi.fasta.gz.fai: samtools faidx index, which allows accessing any region of the assembly in constant time
GCA_*.masked.ncbi.fasta.dict: samtools dict index, which allows identifying a sequence by its MD5 checksum
GCA_*.masked.ncbi.bed.gz: BED file with the coordinates of the regions masked by the NCBI pipeline, with accompanying bgzip and tabix indices (resp. .gzi and .tbi)

Pipeline information

pipeline_info/
- Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter's are used when running the pipeline.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.