Protein Electrostatic Properties are Fine-Tuned Through Evolution
Protein Electrostatic Properties are Fine-Tuned Through Evolution
Shen, M.; Dayhoff, G. W.; Shen, J.
AbstractProtein ionization states provide electrostatic forces to modulate protein structure, stability, solubility, and function. Until now, predicting ionization states and understanding protein electrostatics have relied on structural information. Here we demonstrate that primary sequence alone enables remarkably accurate pKa predictions through KaML-ESM, a model that leverages evolutionary representations from ultra-large protein language models ESMs and pretraining with a synthetic pKa dataset. The KaML-ESM model achieves RMSEs approaching the experimental precision limit of 0.5 pH units for Asp, Glu, His, and Lys residues, while reducing Cys prediction errors to 1.1 units - with further improvement expected as the training dataset expands. The state-of-the-art performance of KaML-ESM was further validated through external evaluations, including a proteome-wide analysis of protein pKa values. Our results support the notation that protein sequence encodes not only structure and function but also electrostatic properties, which may have been co-optimized through evolution. Lastly, we provide KaML, a sequence-based end-toend ML platform that enables researchers to map protein electrostatic landscapes, facilitating applications ranging from drug design and protein engineering to molecular simulations.