MuseDrift: Navigating Protein Evolutionary Manifolds with Conditional Discrete Diffusion
MuseDrift: Navigating Protein Evolutionary Manifolds with Conditional Discrete Diffusion
Wang, C.; Wang, Y.
AbstractProtein engineering often requires generating variants of a wild-type (WT) sequence while controlling how far they drift in sequence space. Existing generative models support de novo design but offer limited control over WT similarity. We introduce MuseDrift, a conditional discrete diffusion model for WT-anchored, distance-controlled protein generation. Trained on a 38.2M-pair Seed-and-Stratify corpus, MuseDrift combines WT-prefix conditioning with random-order iterative unmasking to enable stable multi-residue generation. Its key feature is a calibrated identity dial: after lightweight calibration, generated sequences match a target WT identity {tau} within approximately +/-0.05 over {tau} [isin] [0.55, 0.95] on held-out probes. On Mol-Instructions and CAMEO under shared evaluation oracles, MuseDrift is competitive with multimodal and text-conditioned baselines while uniquely providing explicit identity control. At {tau} = 0.95, it achieves pLDDT scores of 84.97 on Mol-Instructions and 83.14 on CAMEO with only 85M parameters, rivaling much larger 1.8B-2B models. Evolutionary and FoldX analyses further support biological plausibility and structural stability.