Biological foundation models illuminate annotation blind spots in evolutionarily divergent genomes

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

Biological foundation models illuminate annotation blind spots in evolutionarily divergent genomes

Authors

Lanser, T. B.; Caldwell, S. K.; Pacheco, G. A.; Chen, J. W.; Saghaei, S.; Hassan, M.; Kronrod, M.; Wesemann, D. R.; Frost, H. R.

Abstract

Chromosome-scale assemblies are increasingly available for non-model organisms, but functional annotation remains limited when deep evolutionary divergence erodes primary amino-acid sequence identity even though protein structural similarity can remain conserved. We present a hybrid annotation framework that decouples gene-model discovery from cross-species similarity assignment by combining Evo2-based ab initio prediction of exon-intron structures with ESM-2 protein-embedding-based structural similarity mapping. Applied to the sea lamprey, the framework derives high- or medium-confidence cross-species similarity assignments for 73,485 Evo2-derived translated protein models, including 35,395 high-confidence calls, and expands the deduplicated structural catalog to 31,286 loci, including 20,871 additions absent from the Ensembl baseline. A joint alignment-structure classification identifies 21,391 structurally supported catalog loci that a fixed human DIAMOND protein search does not confidently assign on its own, including 21,184 loci with no detectable human protein-sequence match and 207 loci with only low-confidence matches in the classical 20-30% amino-acid-identity twilight zone. These rescue-space totals describe catalog loci rather than validated one-to-one human-absent genes. In a single-cell RNA sequencing application, a stricter UTR-aware Ensembl+Evo2 reference improves gene recovery and expands the interpretable feature space of the lamprey immune compartment relative to the Ensembl baseline. This enables more resolved annotation of four transcriptionally defined immune cell states, including VLRA+-associated T-like and VLRB+-associated B-like programs together with oxidative iron-handling and iron-associated VLR-linked states. Together, these results show that structural protein signal often persists beyond the limits of pairwise sequence alignment and that an embedding-based annotation layer can extend that signal to improve downstream comparative and single-cell analyses in evolutionarily divergent genomes.

Follow Us on

0 comments

Add comment