Introduction
This document describes the output produced by the pipeline.
The directories listed below will be created in a directory based on the --outdir command-line parameter and the outdir column of the samplesheet.
) after the pipeline has finished.
All paths are relative to the top-level results directory.
The directories comply with Tree of Life's canonical directory structure.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- Gene annotation files - Annotation files, either straight from the Ensembl FTP, or indices built on them
- Pipeline information - Report metrics generated during the workflow execution
All data files are compressed (and indexed) with bgzip.
All Fasta files are indexed with samtools faidx, which allows accessing any region of the assembly in constant time, and samtools dict, which allows identifying a sequence by its MD5 checksum.
All BED files are indexed with tabix in both TBI and CSI modes, unless the sequences are too large.
Gene annotation files
Here are the files you can expect in the gene/ sub-directory.
gene
└── ensembl
└── 2022_02
├── GCA_907164925.1.ensembl.2022_02.cdna.fa.gz
├── GCA_907164925.1.ensembl.2022_02.cdna.fa.gz.dict
├── GCA_907164925.1.ensembl.2022_02.cdna.fa.gz.fai
├── GCA_907164925.1.ensembl.2022_02.cdna.fa.gz.gzi
├── GCA_907164925.1.ensembl.2022_02.cdna.fa.gz.sizes
├── GCA_907164925.1.ensembl.2022_02.cds.fa.gz
├── GCA_907164925.1.ensembl.2022_02.cds.fa.gz.dict
├── GCA_907164925.1.ensembl.2022_02.cds.fa.gz.fai
├── GCA_907164925.1.ensembl.2022_02.cds.fa.gz.gzi
├── GCA_907164925.1.ensembl.2022_02.cds.fa.gz.sizes
├── GCA_907164925.1.ensembl.2022_02.gff3.gz
├── GCA_907164925.1.ensembl.2022_02.gff3.gz.csi
├── GCA_907164925.1.ensembl.2022_02.gff3.gz.gzi
├── GCA_907164925.1.ensembl.2022_02.gff3.gz.tbi
├── GCA_907164925.1.ensembl.2022_02.pep.fa.gz
├── GCA_907164925.1.ensembl.2022_02.pep.fa.gz.dict
├── GCA_907164925.1.ensembl.2022_02.pep.fa.gz.fai
├── GCA_907164925.1.ensembl.2022_02.pep.fa.gz.gzi
└── GCA_907164925.1.ensembl.2022_02.pep.fa.gz.sizes
All files are named after:
- the assembly accession, e.g.
GCA_907164925.1; - the annotation method, e.g.
ensembl; - the annotation date, e.g.
2022_02.
These information are also in the directory names to allow multiple annotations to be loaded.
The .seq_length.tsv files are tabular analogous to the common chrom.sizes. They contain the sequence names and their lengths.
The following documentation is copied from Ensembl's FTP
Fasta files
Ensembl provide gene sequences in FASTA format in three files. The 'cdna' file contains transcript sequences for all types of gene (including, for example, pseudogenes and RNA genes). The 'cds' file contains the DNA sequences of the coding regions of protein-coding genes. The 'pep' file contains the amino acid sequences of protein-coding genes.
The headers in the 'cdna' FASTA files have the format:
><transcript_stable_id> <seq_type> <assembly_name>:<seq_name>:<start>:<end>:<strand> gene:<gene_stable_id> gene_biotype:<gene_biotype> transcript_biotype:<transcript_biotype> [gene_symbol:<gene_symbol>] [description:<description>]
Example 'cdna' header:
>ENSZVIT00000000002.1 cdna UG_Zviv_1:LG1:3600:22235:-1 gene:ENSZVIG00000000002.1 gene_biotype:protein_coding transcript_biotype:protein_coding
The headers in the 'cds' FASTA files have the format:
><transcript_stable_id> <seq_type> <assembly_name>:<seq_name>:<coding_start>:<coding_end>:<strand> gene:<gene_stable_id> gene_biotype:<gene_biotype> transcript_biotype:<transcript_biotype> [gene_symbol:<gene_symbol>] [description:<description>]
Example 'cds' header:
>ENSZVIT00000000002.1 cds UG_Zviv_1:LG1:5289:19862:-1 gene:ENSZVIG00000000002.1 gene_biotype:protein_coding transcript_biotype:protein_coding
The headers in the 'pep' FASTA files have the format:
><protein_stable_id> <seq_type> <assembly_name>:<seq_name>:<coding_start>:<coding_end>:<strand> gene:<gene_stable_id> transcript:<transcript_stable_id> gene_biotype:<gene_biotype> transcript_biotype:<transcript_biotype> [gene_symbol:<gene_symbol>] [description:<description>]
Example 'pep' header:
>ENSZVIP00000000002.1 pep UG_Zviv_1:LG1:5289:19862:-1 gene:ENSZVIG00000000002.1 transcript:ENSZVIT00000000002.1 gene_biotype:protein_coding transcript_biotype:protein_coding
Stable IDs for genes, transcripts, and proteins include a version suffix. Gene symbols and descriptions are not available for all genes.
GFF3 file
A GFF3 (specification) file is also provided. GFF3 files are validated using GenomeTools.
The 'type' of gene features is:
genefor protein-coding genesncRNA_genefor RNA genespseudogenefor pseudogenes
The 'type' of transcript features is:
mRNAfor protein-coding transcripts- a specific type or RNA transcript such as
snoRNAorlnc_RNA pseudogenic_transcriptfor pseudogenes
All transcripts are linked to exon features.
Protein-coding transcripts are linked to CDS, five_prime_UTR, and
three_prime_UTR features.
Attributes for feature types: (italics indicate data which is not available for all features)
- region types:
ID: Unique identifier, format<region_type>:<region_name>Alias: A comma-separated list of aliases, usually including theINSDCaccession- _
Is_circular_: Flag to indicate circular regions
- gene types:
ID: Unique identifier, formatgene:<gene_stable_id>biotype: Ensembl biotype, e.g.protein_coding,pseudogenegene_id: Ensembl gene stable IDversion: Ensembl gene versionName: Gene namedescription: Gene description
- transcript types:
ID: Unique identifier, formattranscript:<transcript_stable_id>Parent: Gene identifier, formatgene:<gene_stable_id>biotype: Ensembl biotype, e.g.protein_coding,pseudogenetranscript_id: Ensembl transcript stable IDversion: Ensembl transcript versionNote: If the transcript sequence has been edited (i.e. differs from the genomic sequence), the edits are described in a note.
- exon
Parent: Transcript identifier, formattranscript:<transcript_stable_id>exon_id: Ensembl exon stable IDversion: Ensembl exon versionconstitutive: Flag to indicate if exon is present in all transcriptsrank: Integer that show the 5'->3' ordering of exons
- CDS
ID: Unique identifier, formatCDS:<protein_stable_id>Parent: Transcript identifier, formattranscript:<transcript_stable_id>protein_id: Ensembl protein stable IDversion: Ensembl protein version
Pipeline information
pipeline_info/ensemblgenedownload/
- Reports generated by Nextflow:
execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline:
pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter's are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv.
pipeline_info/ensemblgenedownload/- Reports generated by Nextflow:
execution_report.html,execution_timeline.html,execution_trace.txtandpipeline_dag.dot/pipeline_dag.svg. - Reports generated by the pipeline:
pipeline_report.html,pipeline_report.txtandsoftware_versions.yml. Thepipeline_report*files will only be present if the--email/--email_on_failparameter's are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv.
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.