Science Cast

ClaroAI-Bench: Evaluating Agentic Scientific Reproducibility on Real Biomedical Papers

librarianMay 13, 2026 5:56am

Views (7)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

ClaroAI-Bench: Evaluating Agentic Scientific Reproducibility on Real Biomedical Papers

bioRxivPDFMay 12, 2026 12:00am

Authors

O'Connell, K. A.

Abstract

We introduce ClaroAI-Bench, an evaluation suite for measuring AI agents' ability to reproduce computational findings from published biomedical research. The benchmark comprises 35 real NIH-funded papers spanning five modalities (genomics, imaging, clinical/EHR, epidemiology, wet-lab) scored on a five-dimension rubric: data findability (D1), data accessibility (D2), code availability (D3), environment reconstructability (D4), and results reproducibility (D5). Each task requires an agent to locate data, obtain code, reconstruct the compute environment, execute the analysis, and verify results against published claims, mirroring the full scientific reproduction pipeline. In a three-condition ablation, an audit-only baseline (D1-D4 metadata scoring) and a bash-only agent (API + bash tool) both achieve 0% D5 reproduction, while a full-capability agent (Claude Code, all tools) reproduces 20 of 33 computational papers (60.6%; 95% CI [42.4, 75.8]). D1-D4 metadata scores strongly predict D5 outcomes (Spearman r=0.68, p<0.0001), and papers with accessible data and code achieve 2.9 times higher D5 scores than restricted papers (p=0.0013). Multi-model scoring with three frontier models (Claude Opus 4.6, GPT-5.4, Gemini 2.5 Pro) yields inter-model agreement of r=0.85-0.97 on D3 but only r=0.51-0.81 on D4, identifying environment reconstruction as the dimension with highest evaluator disagreement. ClaroAI-Bench fills a gap between code-generation benchmarks (SWE-bench) and end-to-end scientific AI evaluations by testing long-horizon, real-world reproduction tasks with brittle environments, missing metadata, and access constraints. The benchmark, scoring rubric, agent logs, and pip-installable auditor are archived at https://doi.org/10.5281/zenodo.20071236 and on HuggingFace Datasets at https://huggingface.co/datasets/kyleaoconnell22/claroai-bench.

TwitterandLinkedIn

0 comments

Add comment

ClaroAI-Bench: Evaluating Agentic Scientific Reproducibility on Real Biomedical Papers

ClaroAI-Bench: Evaluating Agentic Scientific Reproducibility on Real Biomedical Papers

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments