AI 과학자들은 과학적 추론 없이 결과를 생산한다

초록

대규모 언어 모델(LLM) 기반 시스템이 과학적 연구를 자율적으로 수행하기 위해 점점 더 많이 배포되고 있지만, 그들의 추론이 과학적 탐구를 자기 수정적으로 만드는 인식론적 규범을 준수하는지는 제대로 이해되지 않고 있다. 본 연구에서는 25,000회 이상의 에이전트 실행과 두 가지 상호 보완적 관점을 통해 워크플로 실행에서 가설 주도 탐구에 이르는 8개 영역에 걸쳐 LLM 기반 과학 에이전트를 평가한다: (i) 기본 모델과 에이전트 스캐폴드의 기여도를 분해하는 체계적 성능 분석, (ii) 에이전트 추론의 인식론적 구조에 대한 행동 분석. 우리는 기본 모델이 성능과 행동 모두의 주요 결정 요인이며, 설명된 분산의 41.4%를 차지하는 반면 스캐폴드는 1.5%에 불과함을 관찰했다. 모든 구성에서 증거는 추적의 68%에서 무시되며, 반증 주도 신념 수정은 26%에서 발생하고, 수렴적 다중 검증 증거는 드물다. 동일한 추론 패턴은 에이전트가 계산 워크플로를 실행하든 가설 주도 탐구를 수행하든 관계없이 나타난다. 이러한 패턴은 에이전트가 거의 완벽한 성공적인 추론 궤적을 컨텍스트로 받는 경우에도 지속되며, 그 결과 발생하는 신뢰성 부족은 인식론적으로 요구되는 영역에서 반복 시행에 걸쳐 누적된다. 따라서 현재의 LLM 기반 에이전트는 과학적 워크플로는 실행하지만 과학적 추론을 특징짓는 인식론적 패턴은 보이지 않는다. 결과 기반 평가는 이러한 실패를 감지할 수 없으며, 스캐폴드 엔지니어링만으로는 이를 수정할 수 없다. 추론 자체가 훈련 목표가 되기 전까지는 이러한 에이전트가 생산하는 과학적 지식은 이를 생성한 과정에 의해 정당화될 수 없다.

English

Large language model (LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to the epistemic norms that make scientific inquiry self-correcting is poorly understood. Here, we evaluate LLM-based scientific agents across eight domains, spanning workflow execution to hypothesis-driven inquiry, through more than 25,000 agent runs and two complementary lenses: (i) a systematic performance analysis that decomposes the contributions of the base model and the agent scaffold, and (ii) a behavioral analysis of the epistemological structure of agent reasoning. We observe that the base model is the primary determinant of both performance and behavior, accounting for 41.4% of explained variance versus 1.5% for the scaffold. Across all configurations, evidence is ignored in 68% of traces, refutation-driven belief revision occurs in 26%, and convergent multi-test evidence is rare. The same reasoning pattern appears whether the agent executes a computational workflow or conducts hypothesis-driven inquiry. They persist even when agents receive near-complete successful reasoning trajectories as context, and the resulting unreliability compounds across repeated trials in epistemically demanding domains. Thus, current LLM-based agents execute scientific workflows but do not exhibit the epistemic patterns that characterize scientific reasoning. Outcome-based evaluation cannot detect these failures, and scaffold engineering alone cannot repair them. Until reasoning itself becomes a training target, the scientific knowledge produced by such agents cannot be justified by the process that generated it.

AI 과학자들은 과학적 추론 없이 결과를 생산한다

AI scientists produce results without reasoning scientifically

초록

Support