当AI科研助手失误时：SPOT——科学研究的自动化验证基准

摘要

近期大型语言模型（LLMs）的进展推动了自动化科学发现的愿景，常被称为AI协科学家。迄今为止，先前的研究将这些系统定位为生成性合著者，负责构思假设、整合代码或起草手稿。在本研究中，我们探索了一种互补应用：利用LLMs作为验证者，自动化科学手稿的学术验证。为此，我们引入了SPOT数据集，包含83篇已发表论文及其对应的91个足以引发勘误或撤稿的重大错误，这些错误已与实际作者和人工标注者进行了交叉验证。在SPOT上评估最先进的LLMs，我们发现无一模型能超越21.1%的召回率或6.1%的精确度（o3表现最佳，其余模型接近零）。此外，置信度估计普遍偏低，且在八次独立运行中，模型很少能重复发现相同错误，这削弱了其可靠性。最后，与领域专家的定性分析表明，即使是最强的模型也会犯下类似学生因误解而产生的错误。这些发现凸显了当前LLM能力与可靠AI辅助学术验证需求之间的显著差距。

English

Recent advances in large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists. To date, prior work casts these systems as generative co-authors responsible for crafting hypotheses, synthesizing code, or drafting manuscripts. In this work, we explore a complementary application: using LLMs as verifiers to automate the academic verification of scientific manuscripts. To that end, we introduce SPOT, a dataset of 83 published papers paired with 91 errors significant enough to prompt errata or retraction, cross-validated with actual authors and human annotators. Evaluating state-of-the-art LLMs on SPOT, we find that none surpasses 21.1\% recall or 6.1\% precision (o3 achieves the best scores, with all others near zero). Furthermore, confidence estimates are uniformly low, and across eight independent runs, models rarely rediscover the same errors, undermining their reliability. Finally, qualitative analysis with domain experts reveals that even the strongest models make mistakes resembling student-level misconceptions derived from misunderstandings. These findings highlight the substantial gap between current LLM capabilities and the requirements for dependable AI-assisted academic verification.

当AI科研助手失误时：SPOT——科学研究的自动化验证基准

When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research

摘要

Support