OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability
OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability
Wang, L.
AbstractMixture-of-Experts (MoE) architectures offer a rare opportunity to probe the internal organization of large language models, but this affordance has not been systematically exploited in biological foundation modeling. We introduce OmniGene-4, a unified bio-language foundation model built on Gemma-4-26B-A4B (30 layers, 128 experts per layer, top-8 routing) by injecting 28,028 biological tokens (DNA and protein BPE, Foldseek 3Di, DSSP secondary structure), continuing pretraining (CPT) on a 32.5 GB mixture of DNA, protein, natural-language and structural corpora, and supervised fine-tuning (SFT) on 199,576 instruction-format examples spanning eight task families. On a suite of standard benchmarks, the final model (v3) reaches 99.95% accuracy on BioPAWS standard protein homology (6,000 pairs), 59.50% on remote homology (2,000 pairs from protein_pair_remote), and 93.66% on BixBench knowledge questions. Relative to its un-fine-tuned vocabulary-extended Gemma-4-Instruct baseline (85% / 60% / 87%), v3 gains +14.5 on Standard, is comparable on Remote (-0.5, within statistical noise on this 2,000-pair sample), and gains +6.7 on BixBench. We do not claim parity with specialist remote-homology tools; published numbers for ESM-2, CATHe and PLMSearch on differently constructed splits reach 65--75%, and closing this gap is discussed as an open problem. By installing forward hooks on every router we directly measure how CPT and SFT each reshape expert routing. Across 400 prompts drawn from 8 modalities, the mean pair-wise Jensen--Shannon divergence between task routing distributions, averaged over the 30 layers, rises from 0.138 (vocabulary-extended baseline) to 0.230 after CPT and further to 0.232 after the full CPT+SFT pipeline. Under this layer-averaged metric, most of the increase (Delta JS +0.092) occurs during CPT, with the SFT stage contributing a small further rise (Delta JS +0.002). The layer-wise picture is more nuanced: CPT reshapes routing in middle transformer layers (L_11--L_22, peak +0.16 at L_12), while SFT primarily reshapes the final two layers (L_28, L_29, peak +0.048 at L_29), so SFT is small under the aggregate metric but non-trivial at the layers nearest lm_head. We summarize this as a tentative representation/output-alignment factorization of bio-foundation training. At the token level, layer-12 routing reveals experts with strongly skewed token preferences, including an English-function-word expert at 80% NL purity, two DNA-dinucleotide experts, an amino-acid expert, and a cellular-biology expert; absolute purities for other experts are modest (15--46%), and we do not assume that "the same expert ID" refers to the same object across different layers. These findings are exploratory --- a single architecture, a single training run, and a small-N routing sample --- and we explicitly frame them as such throughout.