牛顿基准：评估LLM代理在可泛化科学定律发现中的表现

摘要

大型语言模型正逐渐成为科学定律发现的有力工具，这是AI驱动科学中的一个基础性挑战。然而，现有针对此任务的基准测试面临一个根本性的方法论三难困境，迫使在科学相关性、可扩展性和抗记忆性之间做出权衡。此外，这些基准将发现过程过度简化为静态函数拟合，未能捕捉到通过复杂模型系统的交互探索来揭示嵌入定律的真实科学过程。为解决这些关键缺陷，我们引入了NewtonBench，一个包含12个物理领域中324个科学定律发现任务的基准测试。我们的设计通过使用形而上学转变——对经典定律的系统性修改——来生成大量既具有可扩展性、科学相关性又抗记忆性的问题，从而缓解了评估三难困境。此外，我们将评估从静态函数拟合提升到交互式模型发现，要求智能体通过实验探测模拟的复杂系统以揭示隐藏原理。我们的大量实验揭示了前沿LLM在发现能力上的明确但脆弱的特性：随着系统复杂性的增加，这种能力急剧下降，并且对观测噪声表现出极端的敏感性。值得注意的是，我们发现了一个工具辅助的悖论效应：提供代码解释器可能会阻碍更有能力的模型，因为它会诱导模型过早地从探索转向利用，导致它们满足于次优解。这些结果表明，在复杂、交互环境中实现稳健、可推广的发现仍然是核心挑战。通过提供一个可扩展、稳健且科学真实的测试平台，NewtonBench为衡量真实进展和指导能够实现真正科学发现的下一代AI智能体的开发提供了关键工具。

English

Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using metaphysical shifts - systematic alterations of canonical laws - to generate a vast suite of problems that are scalable, scientifically relevant, and memorization-resistant. Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive experiment reveals a clear but fragile capability for discovery in frontier LLMs: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next-generation AI agents capable of genuine scientific discovery.

牛顿基准：评估LLM代理在可泛化科学定律发现中的表现

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

摘要

Support