Description
Collapse small ORFs that share an amino-acid sequence cluster into a single
catalogue entry. Pair with custom/orfmerge (coordinate-based catalogue),
bedtools/getfasta + seqkit/translate (AA FASTA keyed by orf_id), and
mmseqs/easycluster (AA clusters) upstream.
The coordinate-based merge in custom/orfmerge only groups ORFs that overlap
on the genome, so the same micropeptide encoded at several distinct,
non-overlapping loci (typically repetitive regions) survives as separate rows.
This adopts the peptide-level deduplication and 0.9 amino-acid-similarity
threshold of the GENCODE Ribo-seq ORF consolidation (Mudge et al. 2022,
Nat Biotechnol, doi:10.1038/s41587-022-01369-0; gencode-riboseqORFs
collapse_cutoff 0.9), implemented here with MMseqs2 sequence-identity
clustering rather than that tool's longest-shared-string / P-site-overlap
metric. Small ORFs (orf_class "smORF", i.e. aa_length <= 100) are clustered by
amino-acid identity upstream and this module folds each multi-member cluster
down to one representative.
Only smORF rows are collapsed; larger ORFs and transcript-anchored classes are
passed through untouched. Among the smORF members of a cluster the
representative is chosen by longest aa_length (ties broken by orf_id), so the
result does not depend on which sequence MMseqs2 labelled the cluster
representative. Catalogue row order is preserved; dropped members fold their
called_by_<caller> / score_<caller> evidence, n_samples / samples
recurrence and gene mappings into the survivor.
Tools
orfcollapse Documentation
Python helper that folds small-ORF catalogue rows sharing an MMseqs2 amino-acid cluster into a single representative, merging cross-caller provenance, cross-sample recurrence and gene mappings.
Command copied!