Leveraging Large Language Models for Literature-Driven Prioritization of Protein Binding Pockets
Leveraging Large Language Models for Literature-Driven Prioritization of Protein Binding Pockets
Stratiichuk, R.; Melnychenko, M.; Koleiev, I.; Voitsitskyi, T.; Vladyslav, H.; Shevchuk, N.; Osrovsky, Z.; Bdzhola, V.; Yesylevskyy, S.; Starosyla, S.; Nafiiev, A.
AbstractWe present a novel approach for the identification and prioritization of protein binding pockets for small molecules by combining geometric pocket detection with Large Language Models (LLMs). Our method leverages Fpocket to generate candidate pockets, which are then validated against published experimental data extracted from research articles using LLM with a series of prompts fine-tuned to identify and extract residue-level information associated with experimentally confirmed binding sites. We developed a curated benchmark dataset of diverse proteins and associated literature to train and evaluate the LLM's performance in paper relevance assessment and pocket extraction. The extracted information is then mapped onto protein structures and used to filter and merge the geometry-based predictions, generating a refined volumetric representation of biologically relevant pockets. This hybrid pipeline offers an efficient, accurate and automated method for identifying functional binding pockets, addressing a significant bottleneck in the high-throughput drug discovery workflows. The developed benchmark dataset and methodology are freely available at https://github.com/MelnychenkoM/LLM-benchmark-dataset.