Quantifying data reuse in proteomics using PRIDE downloads statistics and a semi-supervised LLM-based framework

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

Quantifying data reuse in proteomics using PRIDE downloads statistics and a semi-supervised LLM-based framework

Authors

Hewapathirana, S.; Bai, J.; Bandla, C.; Kamatchinathan, S.; Kundu, D. J.; John, N. S.; Brown-Harry, B.; Madhusoodanan, N.; Riera Duocastella, J. M.; Vizcaino, J. A.; Perez-Riverol, Y.

Abstract

Understanding how scientific datasets are accessed and reused is essential for resource planning and impact assessment. Here we present the PRIDE Archive download tracking infrastructure and a comprehensive analysis of 159.3 million download records from the PRIDE proteomics database (2021-2025), spanning 35,528 datasets accessed from 235 locations. The infrastructure includes nf-downloadstats, a scalable Nextflow pipeline for processing download logs, and DeepLogBot, a machine-learning framework that classifies traffic into bots, institutional download hubs, and independent user downloads. DeepLogBot combines heuristic seed selection with multi-LLM annotation (Claude and Qwen3) to produce gold-standard training labels, achieving 92.2% bot classification accuracy on a held-out test set. After separating bot traffic, analysis reveals downloads from 214 countries/regions, 249 institutional download hubs, and a concentrated reuse distribution, with the top five countries (United States, United Kingdom, Germany, China, and Canada) accounting for over 54% of independent user downloads. These findings provide actionable insights for repository infrastructure planning and highlight the importance of distinguishing automated from individual access in scientific data resources.

Follow Us on

0 comments

Add comment