當AI共研者失誤時:SPOT——科學研究自動化驗證的基準
When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research
May 17, 2025
作者: Guijin Son, Jiwoo Hong, Honglu Fan, Heejeong Nam, Hyunwoo Ko, Seungwon Lim, Jinyeop Song, Jinha Choi, Gonçalo Paulo, Youngjae Yu, Stella Biderman
cs.AI
摘要
近期大型語言模型(LLMs)的進展激發了自動化科學發現的願景,常被稱為AI共同科學家。迄今為止,先前的研究將這些系統定位為生成性共同作者,負責構建假設、合成代碼或起草手稿。在本研究中,我們探索了一種互補的應用:利用LLMs作為驗證者,自動化科學手稿的學術驗證。為此,我們引入了SPOT,這是一個包含83篇已發表論文與91個足以引發勘誤或撤稿的重大錯誤的數據集,並與實際作者和人類註釋者進行了交叉驗證。在SPOT上評估最先進的LLMs,我們發現沒有一個模型的召回率超過21.1%或精確度超過6.1%(o3表現最佳,其餘模型接近零)。此外,信心估計普遍偏低,且在八次獨立運行中,模型很少重新發現相同的錯誤,這削弱了其可靠性。最後,與領域專家進行的定性分析顯示,即使是最強的模型也會犯下類似學生級別的誤解所導致的錯誤。這些發現凸顯了當前LLMs能力與可靠AI輔助學術驗證需求之間的巨大差距。
English
Recent advances in large language models (LLMs) have fueled the vision of
automated scientific discovery, often called AI Co-Scientists. To date, prior
work casts these systems as generative co-authors responsible for crafting
hypotheses, synthesizing code, or drafting manuscripts. In this work, we
explore a complementary application: using LLMs as verifiers to automate the
academic verification of scientific manuscripts. To that end, we
introduce SPOT, a dataset of 83 published papers paired with 91 errors
significant enough to prompt errata or retraction, cross-validated with actual
authors and human annotators. Evaluating state-of-the-art LLMs on SPOT, we find
that none surpasses 21.1\% recall or 6.1\% precision (o3 achieves the best
scores, with all others near zero). Furthermore, confidence estimates are
uniformly low, and across eight independent runs, models rarely rediscover the
same errors, undermining their reliability. Finally, qualitative analysis with
domain experts reveals that even the strongest models make mistakes resembling
student-level misconceptions derived from misunderstandings. These findings
highlight the substantial gap between current LLM capabilities and the
requirements for dependable AI-assisted academic verification.Summary
AI-Generated Summary