牛顿基准:评估LLM代理在可泛化科学定律发现中的表现
NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents
October 8, 2025
作者: Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y. Wong, Simon See
cs.AI
摘要
大型语言模型正逐渐成为科学定律发现的有力工具,这是AI驱动科学中的一个基础性挑战。然而,现有针对此任务的基准测试面临一个根本性的方法论三难困境,迫使在科学相关性、可扩展性和抗记忆性之间做出权衡。此外,这些基准将发现过程过度简化为静态函数拟合,未能捕捉到通过复杂模型系统的交互探索来揭示嵌入定律的真实科学过程。为解决这些关键缺陷,我们引入了NewtonBench,一个包含12个物理领域中324个科学定律发现任务的基准测试。我们的设计通过使用形而上学转变——对经典定律的系统性修改——来生成大量既具有可扩展性、科学相关性又抗记忆性的问题,从而缓解了评估三难困境。此外,我们将评估从静态函数拟合提升到交互式模型发现,要求智能体通过实验探测模拟的复杂系统以揭示隐藏原理。我们的大量实验揭示了前沿LLM在发现能力上的明确但脆弱的特性:随着系统复杂性的增加,这种能力急剧下降,并且对观测噪声表现出极端的敏感性。值得注意的是,我们发现了一个工具辅助的悖论效应:提供代码解释器可能会阻碍更有能力的模型,因为它会诱导模型过早地从探索转向利用,导致它们满足于次优解。这些结果表明,在复杂、交互环境中实现稳健、可推广的发现仍然是核心挑战。通过提供一个可扩展、稳健且科学真实的测试平台,NewtonBench为衡量真实进展和指导能够实现真正科学发现的下一代AI智能体的开发提供了关键工具。
English
Large language models are emerging as powerful tools for scientific law
discovery, a foundational challenge in AI-driven science. However, existing
benchmarks for this task suffer from a fundamental methodological trilemma,
forcing a trade-off between scientific relevance, scalability, and resistance
to memorization. Furthermore, they oversimplify discovery as static function
fitting, failing to capture the authentic scientific process of uncovering
embedded laws through the interactive exploration of complex model systems. To
address these critical gaps, we introduce NewtonBench, a benchmark comprising
324 scientific law discovery tasks across 12 physics domains. Our design
mitigates the evaluation trilemma by using metaphysical shifts - systematic
alterations of canonical laws - to generate a vast suite of problems that are
scalable, scientifically relevant, and memorization-resistant. Moreover, we
elevate the evaluation from static function fitting to interactive model
discovery, requiring agents to experimentally probe simulated complex systems
to uncover hidden principles. Our extensive experiment reveals a clear but
fragile capability for discovery in frontier LLMs: this ability degrades
precipitously with increasing system complexity and exhibits extreme
sensitivity to observational noise. Notably, we uncover a paradoxical effect of
tool assistance: providing a code interpreter can hinder more capable models by
inducing a premature shift from exploration to exploitation, causing them to
satisfice on suboptimal solutions. These results demonstrate that robust,
generalizable discovery in complex, interactive environments remains the core
challenge. By providing a scalable, robust, and scientifically authentic
testbed, NewtonBench offers a crucial tool for measuring true progress and
guiding the development of next-generation AI agents capable of genuine
scientific discovery.