NewtonBench: LLM 에이전트의 일반화 가능한 과학 법칙 발견 능력 벤치마킹

초록

대형 언어 모델은 AI 기반 과학의 근본적인 도전 과제인 과학 법칙 발견을 위한 강력한 도구로 부상하고 있습니다. 그러나 이 작업을 위한 기존 벤치마크는 과학적 관련성, 확장성, 암기 저항성 사이의 트레이드오프를 강요하는 근본적인 방법론적 딜레마에 직면해 있습니다. 더욱이, 이들은 발견을 정적 함수 피팅으로 지나치게 단순화하여 복잡한 모델 시스템의 상호작용적 탐색을 통해 내재된 법칙을 밝혀내는 진정한 과학적 과정을 포착하지 못하고 있습니다. 이러한 중요한 격차를 해결하기 위해, 우리는 12개의 물리학 영역에 걸친 324개의 과학 법칙 발견 과제로 구성된 NewtonBench 벤치마크를 소개합니다. 우리의 설계는 형이상학적 변화(기존 법칙의 체계적 수정)를 사용하여 확장 가능하고 과학적으로 관련성이 높으며 암기에 강력한 다양한 문제 세트를 생성함으로써 평가 딜레마를 완화합니다. 더 나아가, 우리는 정적 함수 피팅에서 상호작용적 모델 발견으로 평가를 고도화하여, 에이전트가 시뮬레이션된 복잡 시스템을 실험적으로 탐구하여 숨겨진 원리를 발견하도록 요구합니다. 우리의 광범위한 실험은 최첨단 대형 언어 모델의 발견 능력이 명확하지만 취약함을 보여줍니다: 이 능력은 시스템 복잡성이 증가함에 따라 급격히 저하되며 관측 노이즈에 극도로 민감합니다. 특히, 도구 지원의 역설적인 효과를 발견했습니다: 코드 인터프리터를 제공하는 것이 더 능력 있는 모델에게 탐색에서 착취로의 조기 전환을 유도하여 최적이 아닌 해결책에 만족하도록 할 수 있습니다. 이러한 결과는 복잡하고 상호작용적인 환경에서의 견고하고 일반화 가능한 발견이 여전히 핵심 과제임을 보여줍니다. 확장 가능하고 견고하며 과학적으로 진정성 있는 테스트베드를 제공함으로써, NewtonBench은 진정한 진전을 측정하고 진정한 과학적 발견이 가능한 차세대 AI 에이전트 개발을 안내하는 중요한 도구를 제공합니다.

English

Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using metaphysical shifts - systematic alterations of canonical laws - to generate a vast suite of problems that are scalable, scientifically relevant, and memorization-resistant. Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive experiment reveals a clear but fragile capability for discovery in frontier LLMs: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next-generation AI agents capable of genuine scientific discovery.

NewtonBench: LLM 에이전트의 일반화 가능한 과학 법칙 발견 능력 벤치마킹

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

초록

Support