Description

Collapse small ORFs that share an amino-acid sequence cluster into a single catalogue entry. Pair with custom/orfmerge (coordinate-based catalogue), bedtools/getfasta + seqkit/translate (AA FASTA keyed by orf_id), and mmseqs/easycluster (AA clusters) upstream.

The coordinate-based merge in custom/orfmerge only groups ORFs that overlap on the genome, so the same micropeptide encoded at several distinct, non-overlapping loci (typically repetitive regions) survives as separate rows. This adopts the peptide-level deduplication and 0.9 amino-acid-similarity threshold of the GENCODE Ribo-seq ORF consolidation (Mudge et al. 2022, Nat Biotechnol, doi:10.1038/s41587-022-01369-0; gencode-riboseqORFs collapse_cutoff 0.9), implemented here with MMseqs2 sequence-identity clustering rather than that tool's longest-shared-string / P-site-overlap metric. Small ORFs (orf_class "smORF", i.e. aa_length <= 100) are clustered by amino-acid identity upstream and this module folds each multi-member cluster down to one representative.

Only smORF rows are collapsed; larger ORFs and transcript-anchored classes are passed through untouched. Among the smORF members of a cluster the representative is chosen by longest aa_length (ties broken by orf_id), so the result does not depend on which sequence MMseqs2 labelled the cluster representative. Catalogue row order is preserved; dropped members fold their called_by_<caller> / score_<caller> evidence, n_samples / samples recurrence and gene mappings into the survivor.

Input

Name
Description
Pattern

0 ()

1 ()

2 ()

3 ()

4 ()

5 ()

Output

Name
Description
Pattern

0 ()

0 ()

0 ()

0 ()

0 ()

0 ()

Tools

orfcollapse Documentation

Python helper that folds small-ORF catalogue rows sharing an MMseqs2 amino-acid cluster into a single representative, merging cross-caller provenance, cross-sample recurrence and gene mappings.