ChatPaper.aiChatPaper

AI科学家在未經科學推理的情況下得出成果

AI scientists produce results without reasoning scientifically

April 20, 2026
作者: Martiño Ríos-García, Nawaf Alampara, Chandan Gupta, Indrajeet Mandal, Sajid Mannan, Ali Asghar Aghajani, N. M. Anoop Krishnan, Kevin Maik Jablonka
cs.AI

摘要

基於大型語言模型(LLM)的系統正日益被部署用於自主開展科學研究,但其推理過程是否遵循使科學探究具備自我修正能力的認知規範,目前尚缺乏深入理解。本研究通過超過25,000次智能體運行,從兩個互補維度評估跨八個領域(從工作流執行到假設驅動型研究)的LLM科學智能體:(i)系統性性能分析,分解基礎模型與智能體框架的貢獻;(ii)對智能體推理認知結構的行為分析。我們發現基礎模型是性能與行為的主要決定因素,其解釋方差佔比達41.4%,而框架僅佔1.5%。在所有配置中,68%的推理軌跡忽略證據,26%出現反駁驅動的信念修正,而基於多重檢驗的收斂證據極為罕見。無論智能體執行計算工作流還是開展假設驅動型研究,均呈現相同的推理模式。即使為智能體提供近乎完整的成功推理軌跡作為上下文,這些模式依然存在,且在認知要求高的領域中,由此產生的不可靠性會隨重複試驗不斷疊加。因此,當前基於LLM的智能體雖能執行科學工作流,但未展現科學推理特有的認知模式。基於結果的評估無法檢測這些缺陷,僅靠框架工程也難以修復。除非將推理本身作為訓練目標,否則此類智能體產生的科學知識無法由其生成過程提供正當性依據。
English
Large language model (LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to the epistemic norms that make scientific inquiry self-correcting is poorly understood. Here, we evaluate LLM-based scientific agents across eight domains, spanning workflow execution to hypothesis-driven inquiry, through more than 25,000 agent runs and two complementary lenses: (i) a systematic performance analysis that decomposes the contributions of the base model and the agent scaffold, and (ii) a behavioral analysis of the epistemological structure of agent reasoning. We observe that the base model is the primary determinant of both performance and behavior, accounting for 41.4% of explained variance versus 1.5% for the scaffold. Across all configurations, evidence is ignored in 68% of traces, refutation-driven belief revision occurs in 26%, and convergent multi-test evidence is rare. The same reasoning pattern appears whether the agent executes a computational workflow or conducts hypothesis-driven inquiry. They persist even when agents receive near-complete successful reasoning trajectories as context, and the resulting unreliability compounds across repeated trials in epistemically demanding domains. Thus, current LLM-based agents execute scientific workflows but do not exhibit the epistemic patterns that characterize scientific reasoning. Outcome-based evaluation cannot detect these failures, and scaffold engineering alone cannot repair them. Until reasoning itself becomes a training target, the scientific knowledge produced by such agents cannot be justified by the process that generated it.
PDF21April 24, 2026