Detecting and quantifying overparametrization in RNA language models with REDIAL

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

Detecting and quantifying overparametrization in RNA language models with REDIAL

Authors

Teng, D.; Qiu, Y.; Sakthivel, G.; Aranganathan, A.; Herron, L.; Tiwary, P.

Abstract

While RNA language models (LMs) have served as foundation models (FMs) to advanced structural prediction, their evaluation relies heavily on supervised downstream tasks. Such tasks can often mask FM inefficiencies and reflect downstream training set memorization. To address this, here we introduce REDIAL (RNA Embedding perturbation Diagnostics for Language models), a zero-shot, unsupervised framework designed to extract coevolutionary signals directly from the high-dimensional latent spaces of RNA language models. By applying REDIAL, we uncover stark, layer-wise disparities in how popular RNA LMs internalize structural constraints through a layer-wise dissection and ablation study. Our results showed how such layerwise behavior deviates from protein LMs and is related to design flaws in the architectures. Specifically, we show that current RNA LMs are severely overparameterized relative to the limited sequence diversity of available RNA databases, leading to profound parameter inefficiency and overfitting. Furthermore, we establish that structure-guided pre-training fundamentally improves the signal-to-noise ratio of learned coevolutionary couplings compared to sequence-only baselines. Ultimately, this unsupervised evaluation paradigm exposes critical flaws in current parameter scaling strategies and provides a rigorous diagnostic benchmark to guide the development of more efficient, generalizable foundation models for RNA therapeutics and de novo design.

Follow Us on

0 comments

Add comment