HARVEST: Unlocking the Dark Bioactivity Data of Pharmaceutical Patents via Agentic AI
HARVEST: Unlocking the Dark Bioactivity Data of Pharmaceutical Patents via Agentic AI
Shepard, V.; Musin, A.; Chebykina, K.; Zeninskaya, N. A.; Mistryukova, L.; Avchaciov, K.; Fedichev, P. O.
AbstractPharmaceutical patents contain vast Structure-Activity Relationship tables documenting protein-ligand binding data that are technically public yet computationally inaccessible, rendering this wealth of data effectively dark - trapped in unstructured archives no existing database has systematically captured. We present HARVEST, a multi-agent large language model pipeline that autonomously extracts structured bioactivity records from USPTO patent archives at $0.11 per document. Applied to 164,877 patents, HARVEST produced 3.36 million activity records, recovering 365,713 unique scaffolds and 1,108 protein targets absent from BindingDB - completing in under a week a task requiring over 55 years of continuous expert labor. Automated extraction achieves 91% agreement with human curators while exhibiting lower unit-conversion error rates. We further introduce H-Bench, a structurally guaranteed held-out benchmark built from this recovered data. Evaluation of the leading open-source model Boltz-2 on H-Bench reveals a two-dimensional generalization gap: performance degrades both on novel chemical scaffolds and on uncharacterized protein targets, exposing fundamental limitations of models trained on existing public repositories.