Protein Function Prediction with Pretrained ProtT5 Embeddings and Gradient Boosting
Protein Function Prediction with Pretrained ProtT5 Embeddings and Gradient Boosting
Appel, J.; Butcher, N.
AbstractProtein function prediction remains a central challenge in computational biology due to the extreme sparsity and long-tail distribution of Gene Ontology (GO) [1] annotations. Advances in protein language models enable the extraction of dense, fixed-length representations from amino acid sequences, offering a scalable alternative to hand-picked features such as physicochemical properties. In this work, we evaluate a transformer-based embedding approach using ProtT5-XL combined with classical and modern multi-label classifiers for Gene Ontology prediction in the CAFA-6 setting. Fixed-length embeddings were generated via mean pooling of transformer hidden states and used as input to one-vs-rest logistic regression, gradient-boosted decision trees, and a neural network. Models were evaluated on held-out validation data with a focus on threshold selection, prediction sparsity, and behavior across frequent and rare GO terms. Gradient boosting consistently provided the best balance between predictive performance and stable prediction behavior, motivating its use for ontology-specific predictors across molecular function, biological process, and cellular component annotations. This study highlights practical modeling choices for large-scale protein function prediction using pretrained sequence embeddings and provides an interpretable baseline for future CAFA evaluations.