ChatPaper.aiChatPaper

AI科学家在未遵循科学推理的情况下得出研究结论

AI scientists produce results without reasoning scientifically

April 20, 2026
作者: Martiño Ríos-García, Nawaf Alampara, Chandan Gupta, Indrajeet Mandal, Sajid Mannan, Ali Asghar Aghajani, N. M. Anoop Krishnan, Kevin Maik Jablonka
cs.AI

摘要

基于大语言模型(LLM)的科研自主系统正被广泛部署,但其推理过程是否遵循确保科学探究具备自我修正能力的认知规范仍鲜为人知。本研究通过超过25,000次智能体运行,从计算工作流执行到假设驱动探究等八个领域,采用双重互补视角对LLM科学智能体进行评估:(1)系统性能分析——解析基础模型与智能体框架的贡献度;(2)智能体推理认知结构的行为分析。我们发现基础模型是性能与行为模式的主要决定因素,其解释方差占比达41.4%,而框架仅占1.5%。所有配置中,68%的推理轨迹存在忽视证据的现象,26%出现基于反证的信念修正,而基于多重验证的收敛证据极为罕见。无论执行计算工作流还是进行假设驱动探究,智能体均呈现相同的推理模式。即使为其提供近乎完整的成功推理轨迹作为上下文,这些模式依然存在,且在认知要求高的领域中,由此产生的不可靠性会随重复试验不断累积。因此,当前LLM智能体虽能执行科学工作流,却未展现科学推理特有的认知特征。基于结果的评估无法检测这些缺陷,仅靠框架工程亦无法修正。除非将推理本身作为训练目标,否则此类智能体产出的科学知识无法通过其生成过程获得正当性证明。
English
Large language model (LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to the epistemic norms that make scientific inquiry self-correcting is poorly understood. Here, we evaluate LLM-based scientific agents across eight domains, spanning workflow execution to hypothesis-driven inquiry, through more than 25,000 agent runs and two complementary lenses: (i) a systematic performance analysis that decomposes the contributions of the base model and the agent scaffold, and (ii) a behavioral analysis of the epistemological structure of agent reasoning. We observe that the base model is the primary determinant of both performance and behavior, accounting for 41.4% of explained variance versus 1.5% for the scaffold. Across all configurations, evidence is ignored in 68% of traces, refutation-driven belief revision occurs in 26%, and convergent multi-test evidence is rare. The same reasoning pattern appears whether the agent executes a computational workflow or conducts hypothesis-driven inquiry. They persist even when agents receive near-complete successful reasoning trajectories as context, and the resulting unreliability compounds across repeated trials in epistemically demanding domains. Thus, current LLM-based agents execute scientific workflows but do not exhibit the epistemic patterns that characterize scientific reasoning. Outcome-based evaluation cannot detect these failures, and scaffold engineering alone cannot repair them. Until reasoning itself becomes a training target, the scientific knowledge produced by such agents cannot be justified by the process that generated it.
PDF21April 24, 2026