Principles
All Genome After-Party pipelines organise their outputs in a consistent and scalable manner.
The main principles are that:
- Files are uniquely named across the entire Genome After-Party and could be mixed into the same directory without clashing.
- To facilitate this, filenames include all necessary identifiers such as assembly, specimen, or sequencing run.
- These identifiers, are used to name the output directories, each identifier naming a different directory level.
- Analyses that are implemented in multiple pipelines always have the same output name and path.
- File names are as self-explanatory as possible.
- File and directory naming support topping-up, e.g. adding a new specimen, a new run, etc.
Additionally:
- All text files that can be queried by coordinates (e.g. Fasta, BED, bedGraph, VCF, some TSV)
are compressed with
bgzipand indexed withtabixin both.tbiand.csiformats. - All other text files are compressed with
gzipif they typically exceed 10 MB. - Sequence alignments are in CRAM format (version 3.0) with embedded references,
ensuring the files can be read widely and without having to pass the assembly
Fasta file as a parameter, and are all indexed with
samtools indexin.craiformat.
Here is the list of identifiers currently used to named outputs:
| Name | Description | Example value |
|---|---|---|
assembly |
Accession number of the assembly. Linked haploid assemblies may be referred to too. | GCA_936432065.2 (principal)GCA_936443135.2 (alternate) |
type |
Sequencing technology. One of pacbio, hic, illumina, ont, rna. |
hic |
run |
Identifier of the sequencing run. Usually the accession number of the data in INSDC. | ERR9248445 (hic)ERR9284044 (pacbio) |
specimen |
Identifier of the specimen. Usually a ToLID. | icLepMacu1 |
lineage |
Name of the BUSCO lineage, including the _odb* suffix. |
insecta_odb12 |
ancestral_set |
Name of the set of ancestral linkage groups | Merian |
# |
Auto-incremented integer, starting from 1. Typically used to version merged datasets | 1 |
Additionally, tool and software names may be added to the outputs for clarity, especially when different tools could be used, e.g. the aligner or variant-caller.
Below is the canonical structure that all Genome After-Party pipelines abide by.
Placeholders for identifiers are indicated with the ${...} syntax.
To keep the document lean, indices such as .tbi, .crai, etc, are omitted from the listings below.
Assembly
Assemblies are made available in standard directories too by the INSDC download pipeline.
- assembly/
${assembly}.(fa.gz|sizes|assembly_(report|stats).txt|header.sam|SOURCE)
Read mapping
The following outputs mainly come from the read mapping pipeline. Read alignments may also be generated by the BlobToolKit and variant calling pipelines. Contact maps are currently generated by the genome note pipeline.
- read_mapping/
${type}/${specimen}/${run}/${assembly}.${type}.${specimen}.${run}.${aligner}.(coverage.bedGraph.gz|cram)- qc/
${type}.${specimen}.${run}.fastqc.(html|zip)${type}.${specimen}.${run}.filtered_fastqc.(html|zip) – optional${type}.${specimen}.${run}.hifi_trimmer.tar.gz – optional${assembly}.${type}.${specimen}.${run}.multiqc.html
- stats/
${assembly}.${type}.${specimen}.${run}.${aligner}.(flagstat|idxstats|stats.gz)
merged.${#}/${assembly}.${type}.${specimen}.merged.${#}.${aligner}.SOURCE- all like above but with
merged.${#}instead of${run}in the file names
- contact_maps/
${specimen}/${run}/${assembly}.hic.${specimen}.${run}.(cool|mcool|pretext|pretext.png)
When alignments are produced across multiple runs,
merged.${#} is used as a pseudo run identifier,
where ${#} is automatically incremented from 1.
The SOURCE file then lists the runs that were included in the merged analysis.
Example:
read_mapping/hic/icLepMacu1/ERR9248445/GCA_936432065.2.hic.icLepMacu1.ERR9248445.minimap2.cram
read_mapping/hic/icLepMacu1/ERR9248445/qc/hic.icLepMacu1.ERR9248445.fastqc.html
read_mapping/hic/icLepMacu1/merged.1/GCA_936432065.2.hic.icLepMacu1.merged.1.minimap2.cram
Variant calling and analysis
The following outputs come from the variant calling and variant composition pipelines.
- variant_analysis/
${type}/${specimen}/${run}/${assembly}.${type}.${specimen}.${run}.${aligner}.${caller}.(vcf|g.vcf).gz - repeated for as many callers as we use- composition/
${assembly}.${type}.${specimen}.${run}.${aligner}.${caller}.(vcf|g.vcf).(frq|het|indel.hist|roh|sites.pi.gz|snpden)
- qc/
${assembly}.${type}.${specimen}.${run}.${aligner}.${caller}.(vcf|g.vcf).(plot-vcfstats.pdf|stats.visual_report.html)
- stats/
${assembly}.${type}.${specimen}.${run}.${aligner}.${caller}.(vcf|g.vcf).(stats.bcftools.txt.gz|plot-vcfstats.tar.gz)
The triplet (${type}, ${specimen}, ${run}) is expected to match files from the read_mapping/ folder,
including merged.${#} forms of ${run}.
Note: In practice we only envisage to use PacBio data.
Example:
variant_analysis/pacbio/icLepMacu1/ERR9284044/GCA_936432065.2.pacbio.icLepMacu1.ERR9284044.minimap2.deepvariant.vcf.gz
variant_analysis/pacbio/icLepMacu1/ERR9284044/stats/GCA_936432065.2.pacbio.icLepMacu1.ERR9284044.minimap2.deepvariant.vcf.stats.bcftools.txt.gz
variant_analysis/pacbio/icLepMacu1/ERR9284044/composition/GCA_936432065.2.pacbio.icLepMacu1.ERR9284044.minimap2.deepvariant.vcf.sites.pi.gz
variant_analysis/pacbio/icLepMacu1/merged.1/GCA_936432065.2.pacbio.icLepMacu1.merged.1.minimap2.deepvariant.vcf.gz
BUSCO analysis
The following outputs come from the BlobToolKit and genome note pipelines.
- busco/
${lineage}/${assembly}.${lineage}.(full_table.tsv|missing_busco_list.tsv|(single_copy|multi_copy|fragmented)_busco_sequences.tar.gz|short_summary.(json|tsv|txt)|hmmer_output.tar.gz)
- ancestral_plots/
${lineage}/${ancestral_set}/${assembly}.${lineage}.${ancestral_set}.buscopainter.(pdf|png)
Example:
busco/insecta_odb12/GCA_936432065.2.insecta_odb12.full_table.tsv
ancestral_plots/lepidoptera_odb10/Merian/GCA_936432065.2.lepidoptera_odb10.Merian.buscopainter.pdf
BlobToolKit
The following outputs specifically come from the BlobToolKit pipeline.
- blobtoolkit/
${assembly}/- *.json.gz
- plots/
${assembly}.*.png
Example:
blobtoolkit/GCA_936432065.2/
blobtoolkit/plots/GCA_936432065.2.snail.png
Genome statistics and features
The following outputs are computed by the sequence composition and genome note pipelines, or downloaded by the Ensembl gene download and Ensembl repeat download pipelines.
- base_content/
- k1/
${assembly}.(mononuc.1k.tsv.gz|(A|C|G|T|N|(AT|GC)_skew|GC).1k.bedGraph.gz)
- k2/
${assembly}.(dinuc.1k.tsv.gz|(CpG|dinucShannon).1k.bedGraph.gz)
- k3/
${assembly}.(trinuc.1k.tsv.gz|trinucShannon.1k.bedGraph.gz)
- k4/
${assembly}.(tetranuc.1k.tsv.gz|tetranucShannon.1k.bedGraph.gz)
- k1/
- genes/
${source}/${assembly}.${source}.(gff3.gz|(cdna|cds|pep).fa.gz)
- genome_stats/
${assembly}.gfastats.txt${type}/${specimen}/${run}/- merqury/
${assembly}.${type}.${specimen}.${run}.(completeness.stats|qv|spectra-asm.*.png)${assembly}.${type}.${specimen}.${run}.${target_assembly}.(only.bed.gz|qv|spectra-cn.*.png)
- genomescope/
${assembly}.${type}.${specimen}.${run}.genomescope_((transformed_)?(linear|log)_plot.png|(model|summary).txt)
- merqury/
- repeats/
${source}/${assembly}.${source}.(bed.gz|masked.fa.gz)
The triplet (${type}, ${specimen}, ${run}) is expected to match files from the read_mapping/ folder,
including merged.${#} forms of ${run}.
Note: In practice we only envisage to use PacBio data.
Note: the list will significantly increase when full development of the sequence composition pipeline starts.
Example:
base_content/k1/GCA_936432065.2.GC.1k.bedGraph.gz
genes/ensembl.2023_05/GCA_936432065.2.ensembl.2023-05.gff3.gz
genome_stats/pacbio/icLepMacu1/ERR9284044/merqury/GCA_936432065.2.pacbio.icLepMacu1.ERR9284044.completeness.stats
genome_stats/pacbio/icLepMacu1/merged.1/merqury/GCA_936432065.2.pacbio.icLepMacu1.merged.1.GCA_936443135.2.qv
genome_stats/pacbio/icLepMacu1/merged.1/genomescope/GCA_936432065.2.pacbio.icLepMacu1.merged.1.genomescope_log_plot.png
repeats/ensembl/GCA_936432065.2.ensembl.bed.gz
Genome note
The following outputs specifically come from the genome note pipeline, as it fills a template genome note in.
- gene/
${source}/${assembly}.${source}.stats.csv
- genome_note/
${assembly}.(csv|docx|md|xml|genome_note_(consistent|inconsistent).csv)