Set-up, validation, evaluation, and cost-benefit analysis of an AI-assisted assessment of responsible research practices in a sample of life science publications

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

Set-up, validation, evaluation, and cost-benefit analysis of an AI-assisted assessment of responsible research practices in a sample of life science publications

Authors

Kniffert, S.; Kathoefer, B.; Emprechtinger, R.; Pellegrini, P.; Funk, E. M.; Dhamrait, I. S.; Zang, Y.; Bornmueller, A.; Toelch, U.

Abstract

The (semi-)automated screening of publications for diverse quality and transparency criteria is at the core of systematic literature assessment. Typically, the assessment process involves two initial reviewers and one additional reviewer for cases that require reconciliation. Here, we explore to what extent this process can be assisted by Large Language Models (LLMs). Specifically, whether LLMs are capable of assessing responsible research practices (RRPs) in scientific papers in a robust way. We employed proprietary LLMs to assess an initial set of 37 papers across ten RRPs. The same papers were also reviewed by three human reviewers. We iteratively redesigned prompts to increase model accuracy compared to human ratings which we treated as the gold standard. The resulting pipeline was validated on an additional set of 15 papers. We show that LLM accuracy is comparable to single human reviewer performance (90% for LLM vs 86% for a single human reviewer). However, performance strongly depended on the specific RRPs with accuracy ranging from 40% to 100%. LLMs exhibited an affirmative bias, making more errors when practices were not reported in the papers. Overall, we show how such an approach potentially replaces one human reviewer, enabling AI-assisted assessment of research papers. We discuss how dataset imbalances, validation procedures, and implementation time limit the broad applicability of such approaches. Through this, we develop initial guidance on the utility of proprietary LLMs in evidence synthesis.

Follow Us on

0 comments

Add comment