Principles

All Genome After-Party pipelines organise their outputs in a consistent and scalable manner.

The main principles are that:

  • Files are uniquely named across the entire Genome After-Party and could be mixed into the same directory without clashing.
  • To facilitate this, filenames include all necessary identifiers such as assembly, specimen, or sequencing run.
  • These identifiers, are used to name the output directories, each identifier naming a different directory level.
  • Analyses that are implemented in multiple pipelines always have the same output name and path.
  • File names are as self-explanatory as possible.
  • File and directory naming support topping-up, e.g. adding a new specimen, a new run, etc.

Additionally:

  • All text files that can be queried by coordinates (e.g. Fasta, BED, bedGraph, VCF, some TSV) are compressed with bgzip and indexed with tabix in both .tbi and .csi formats.
  • All other text files are compressed with gzip if they typically exceed 10 MB.
  • Sequence alignments are in CRAM format (version 3.0) with embedded references, ensuring the files can be read widely and without having to pass the assembly Fasta file as a parameter, and are all indexed with samtools index in .crai format.

Here is the list of identifiers currently used to named outputs:

Name Description Example value
assembly Accession number of the assembly. Linked haploid assemblies may be referred to too. GCA_936432065.2 (principal)
GCA_936443135.2 (alternate)
type Sequencing technology. One of pacbio, hic, illumina, ont, rna. hic
run Identifier of the sequencing run. Usually the accession number of the data in INSDC. ERR9248445 (hic)
ERR9284044 (pacbio)
specimen Identifier of the specimen. Usually a ToLID. icLepMacu1
lineage Name of the BUSCO lineage, including the _odb* suffix. insecta_odb12
ancestral_set Name of the set of ancestral linkage groups Merian
# Auto-incremented integer, starting from 1. Typically used to version merged datasets 1

Additionally, tool and software names may be added to the outputs for clarity, especially when different tools could be used, e.g. the aligner or variant-caller.

Below is the canonical structure that all Genome After-Party pipelines abide by. Placeholders for identifiers are indicated with the ${...} syntax. To keep the document lean, indices such as .tbi, .crai, etc, are omitted from the listings below.

Assembly

Assemblies are made available in standard directories too by the INSDC download pipeline.

  • assembly/
    • ${assembly}.(fa.gz|sizes|assembly_(report|stats).txt|header.sam|SOURCE)

Read mapping

The following outputs mainly come from the read mapping pipeline. Read alignments may also be generated by the BlobToolKit and variant calling pipelines. Contact maps are currently generated by the genome note pipeline.

  • read_mapping/
    • ${type}/
      • ${specimen}/
        • ${run}/
          • ${assembly}.${type}.${specimen}.${run}.${aligner}.(coverage.bedGraph.gz|cram)
          • qc/
            • ${type}.${specimen}.${run}.fastqc.(html|zip)
            • ${type}.${specimen}.${run}.filtered_fastqc.(html|zip) – optional
            • ${type}.${specimen}.${run}.hifi_trimmer.tar.gz – optional
            • ${assembly}.${type}.${specimen}.${run}.multiqc.html
          • stats/
            • ${assembly}.${type}.${specimen}.${run}.${aligner}.(flagstat|idxstats|stats.gz)
        • merged.${#}/
          • ${assembly}.${type}.${specimen}.merged.${#}.${aligner}.SOURCE
          • all like above but with merged.${#} instead of ${run} in the file names
  • contact_maps/
    • ${specimen}/
      • ${run}/
        • ${assembly}.hic.${specimen}.${run}.(cool|mcool|pretext|pretext.png)

When alignments are produced across multiple runs, merged.${#} is used as a pseudo run identifier, where ${#} is automatically incremented from 1. The SOURCE file then lists the runs that were included in the merged analysis.

Example:

read_mapping/hic/icLepMacu1/ERR9248445/GCA_936432065.2.hic.icLepMacu1.ERR9248445.minimap2.cram
read_mapping/hic/icLepMacu1/ERR9248445/qc/hic.icLepMacu1.ERR9248445.fastqc.html
read_mapping/hic/icLepMacu1/merged.1/GCA_936432065.2.hic.icLepMacu1.merged.1.minimap2.cram

Variant calling and analysis

The following outputs come from the variant calling and variant composition pipelines.

  • variant_analysis/
    • ${type}/
      • ${specimen}/
        • ${run}/
          • ${assembly}.${type}.${specimen}.${run}.${aligner}.${caller}.(vcf|g.vcf).gz - repeated for as many callers as we use
          • composition/
            • ${assembly}.${type}.${specimen}.${run}.${aligner}.${caller}.(vcf|g.vcf).(frq|het|indel.hist|roh|sites.pi.gz|snpden)
          • qc/
            • ${assembly}.${type}.${specimen}.${run}.${aligner}.${caller}.(vcf|g.vcf).(plot-vcfstats.pdf|stats.visual_report.html)
          • stats/
            • ${assembly}.${type}.${specimen}.${run}.${aligner}.${caller}.(vcf|g.vcf).(stats.bcftools.txt.gz|plot-vcfstats.tar.gz)

The triplet (${type}, ${specimen}, ${run}) is expected to match files from the read_mapping/ folder, including merged.${#} forms of ${run}.

Note: In practice we only envisage to use PacBio data.

Example:

variant_analysis/pacbio/icLepMacu1/ERR9284044/GCA_936432065.2.pacbio.icLepMacu1.ERR9284044.minimap2.deepvariant.vcf.gz
variant_analysis/pacbio/icLepMacu1/ERR9284044/stats/GCA_936432065.2.pacbio.icLepMacu1.ERR9284044.minimap2.deepvariant.vcf.stats.bcftools.txt.gz
variant_analysis/pacbio/icLepMacu1/ERR9284044/composition/GCA_936432065.2.pacbio.icLepMacu1.ERR9284044.minimap2.deepvariant.vcf.sites.pi.gz
variant_analysis/pacbio/icLepMacu1/merged.1/GCA_936432065.2.pacbio.icLepMacu1.merged.1.minimap2.deepvariant.vcf.gz

BUSCO analysis

The following outputs come from the BlobToolKit and genome note pipelines.

  • busco/
    • ${lineage}/
      • ${assembly}.${lineage}.(full_table.tsv|missing_busco_list.tsv|(single_copy|multi_copy|fragmented)_busco_sequences.tar.gz|short_summary.(json|tsv|txt)|hmmer_output.tar.gz)
  • ancestral_plots/
    • ${lineage}/
      • ${ancestral_set}/
        • ${assembly}.${lineage}.${ancestral_set}.buscopainter.(pdf|png)

Example:

busco/insecta_odb12/GCA_936432065.2.insecta_odb12.full_table.tsv
ancestral_plots/lepidoptera_odb10/Merian/GCA_936432065.2.lepidoptera_odb10.Merian.buscopainter.pdf

BlobToolKit

The following outputs specifically come from the BlobToolKit pipeline.

  • blobtoolkit/
    • ${assembly}/
      • *.json.gz
    • plots/
      • ${assembly}.*.png

Example:

blobtoolkit/GCA_936432065.2/
blobtoolkit/plots/GCA_936432065.2.snail.png

Genome statistics and features

The following outputs are computed by the sequence composition and genome note pipelines, or downloaded by the Ensembl gene download and Ensembl repeat download pipelines.

  • base_content/
    • k1/
      • ${assembly}.(mononuc.1k.tsv.gz|(A|C|G|T|N|(AT|GC)_skew|GC).1k.bedGraph.gz)
    • k2/
      • ${assembly}.(dinuc.1k.tsv.gz|(CpG|dinucShannon).1k.bedGraph.gz)
    • k3/
      • ${assembly}.(trinuc.1k.tsv.gz|trinucShannon.1k.bedGraph.gz)
    • k4/
      • ${assembly}.(tetranuc.1k.tsv.gz|tetranucShannon.1k.bedGraph.gz)
  • genes/
    • ${source}/
      • ${assembly}.${source}.(gff3.gz|(cdna|cds|pep).fa.gz)
  • genome_stats/
    • ${assembly}.gfastats.txt
    • ${type}/
      • ${specimen}/
        • ${run}/
          • merqury/
            • ${assembly}.${type}.${specimen}.${run}.(completeness.stats|qv|spectra-asm.*.png)
            • ${assembly}.${type}.${specimen}.${run}.${target_assembly}.(only.bed.gz|qv|spectra-cn.*.png)
          • genomescope/
            • ${assembly}.${type}.${specimen}.${run}.genomescope_((transformed_)?(linear|log)_plot.png|(model|summary).txt)
  • repeats/
    • ${source}/
      • ${assembly}.${source}.(bed.gz|masked.fa.gz)

The triplet (${type}, ${specimen}, ${run}) is expected to match files from the read_mapping/ folder, including merged.${#} forms of ${run}.

Note: In practice we only envisage to use PacBio data.

Note: the list will significantly increase when full development of the sequence composition pipeline starts.

Example:

base_content/k1/GCA_936432065.2.GC.1k.bedGraph.gz
genes/ensembl.2023_05/GCA_936432065.2.ensembl.2023-05.gff3.gz
genome_stats/pacbio/icLepMacu1/ERR9284044/merqury/GCA_936432065.2.pacbio.icLepMacu1.ERR9284044.completeness.stats
genome_stats/pacbio/icLepMacu1/merged.1/merqury/GCA_936432065.2.pacbio.icLepMacu1.merged.1.GCA_936443135.2.qv
genome_stats/pacbio/icLepMacu1/merged.1/genomescope/GCA_936432065.2.pacbio.icLepMacu1.merged.1.genomescope_log_plot.png
repeats/ensembl/GCA_936432065.2.ensembl.bed.gz

Genome note

The following outputs specifically come from the genome note pipeline, as it fills a template genome note in.

  • gene/
    • ${source}/
      • ${assembly}.${source}.stats.csv
  • genome_note/
    • ${assembly}.(csv|docx|md|xml|genome_note_(consistent|inconsistent).csv)