AI科学家在未經科學推理的情況下得出成果

摘要

基於大型語言模型（LLM）的系統正日益被部署用於自主開展科學研究，但其推理過程是否遵循使科學探究具備自我修正能力的認知規範，目前尚缺乏深入理解。本研究通過超過25,000次智能體運行，從兩個互補維度評估跨八個領域（從工作流執行到假設驅動型研究）的LLM科學智能體：（i）系統性性能分析，分解基礎模型與智能體框架的貢獻；（ii）對智能體推理認知結構的行為分析。我們發現基礎模型是性能與行為的主要決定因素，其解釋方差佔比達41.4%，而框架僅佔1.5%。在所有配置中，68%的推理軌跡忽略證據，26%出現反駁驅動的信念修正，而基於多重檢驗的收斂證據極為罕見。無論智能體執行計算工作流還是開展假設驅動型研究，均呈現相同的推理模式。即使為智能體提供近乎完整的成功推理軌跡作為上下文，這些模式依然存在，且在認知要求高的領域中，由此產生的不可靠性會隨重複試驗不斷疊加。因此，當前基於LLM的智能體雖能執行科學工作流，但未展現科學推理特有的認知模式。基於結果的評估無法檢測這些缺陷，僅靠框架工程也難以修復。除非將推理本身作為訓練目標，否則此類智能體產生的科學知識無法由其生成過程提供正當性依據。

English

Large language model (LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to the epistemic norms that make scientific inquiry self-correcting is poorly understood. Here, we evaluate LLM-based scientific agents across eight domains, spanning workflow execution to hypothesis-driven inquiry, through more than 25,000 agent runs and two complementary lenses: (i) a systematic performance analysis that decomposes the contributions of the base model and the agent scaffold, and (ii) a behavioral analysis of the epistemological structure of agent reasoning. We observe that the base model is the primary determinant of both performance and behavior, accounting for 41.4% of explained variance versus 1.5% for the scaffold. Across all configurations, evidence is ignored in 68% of traces, refutation-driven belief revision occurs in 26%, and convergent multi-test evidence is rare. The same reasoning pattern appears whether the agent executes a computational workflow or conducts hypothesis-driven inquiry. They persist even when agents receive near-complete successful reasoning trajectories as context, and the resulting unreliability compounds across repeated trials in epistemically demanding domains. Thus, current LLM-based agents execute scientific workflows but do not exhibit the epistemic patterns that characterize scientific reasoning. Outcome-based evaluation cannot detect these failures, and scaffold engineering alone cannot repair them. Until reasoning itself becomes a training target, the scientific knowledge produced by such agents cannot be justified by the process that generated it.

AI科学家在未經科學推理的情況下得出成果

AI scientists produce results without reasoning scientifically

摘要

Support