NewtonBench：評估LLM代理在可泛化科學定律發現中的基準測試

摘要

大型語言模型正逐漸成為科學定律發現的強大工具，這是AI驅動科學中的一項基礎性挑戰。然而，現有的基準測試在這一任務上存在著根本性的方法論三難困境，迫使在科學相關性、可擴展性和抗記憶化之間做出取捨。此外，這些基準測試將發現過程過度簡化為靜態函數擬合，未能捕捉到通過對複雜模型系統進行互動探索來揭示內嵌定律的真實科學過程。為解決這些關鍵缺陷，我們引入了NewtonBench，這是一個包含12個物理領域中324項科學定律發現任務的基準測試。我們的設計通過使用形而上學轉變——對經典定律進行系統性修改——來生成大量既具可擴展性、科學相關性又抗記憶化的問題，從而緩解了評估三難困境。此外，我們將評估從靜態函數擬合提升至互動模型發現，要求代理通過實驗探測模擬的複雜系統來揭示隱藏原理。我們的大量實驗揭示了前沿大型語言模型在發現能力上存在明顯但脆弱的特性：這一能力隨著系統複雜性的增加而急劇下降，並對觀測噪聲表現出極端敏感性。值得注意的是，我們發現了工具輔助的悖論效應：提供代碼解釋器可能會阻礙更有能力的模型，因為它誘導了從探索到利用的過早轉變，導致它們滿足於次優解。這些結果表明，在複雜的互動環境中實現穩健、可泛化的發現仍然是核心挑戰。通過提供一個可擴展、穩健且科學真實的測試平台，NewtonBench為衡量真正的進展和指導能夠實現真正科學發現的下一代AI代理的開發提供了關鍵工具。

English

Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using metaphysical shifts - systematic alterations of canonical laws - to generate a vast suite of problems that are scalable, scientifically relevant, and memorization-resistant. Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive experiment reveals a clear but fragile capability for discovery in frontier LLMs: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next-generation AI agents capable of genuine scientific discovery.

NewtonBench：評估LLM代理在可泛化科學定律發現中的基準測試

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

摘要

Support