NewtonBench: LLMエージェントにおける一般化可能な科学法則発見のベンチマーク

要旨

大規模言語モデルは、AI駆動科学における基礎的な課題である科学法則の発見において、強力なツールとして台頭しつつある。しかし、このタスクに対する既存のベンチマークは、根本的な方法論的トリレンマに悩まされており、科学的関連性、スケーラビリティ、記憶化への耐性の間でトレードオフを迫られている。さらに、これらのベンチマークは発見を静的な関数フィッティングとして過度に単純化しており、複雑なモデルシステムをインタラクティブに探索することで埋め込まれた法則を明らかにするという本物の科学的プロセスを捉え損ねている。これらの重要なギャップに対処するため、我々はNewtonBenchを導入する。これは12の物理学領域にわたる324の科学法則発見タスクから構成されるベンチマークである。我々の設計は、メタフィジカルシフト（正統的な法則の体系的な変更）を使用して、スケーラブルで科学的に関連性があり、記憶化に耐性のある問題群を生成することで、評価のトリレンマを緩和する。さらに、評価を静的な関数フィッティングからインタラクティブなモデル発見に昇華させ、エージェントがシミュレートされた複雑なシステムを実験的に探査して隠れた原理を明らかにすることを要求する。我々の大規模な実験は、最先端のLLMにおける発見能力が明確ではあるが脆弱であることを明らかにした：この能力はシステムの複雑さが増すにつれて急激に低下し、観測ノイズに対して極端に敏感である。特に、ツール支援の逆説的な効果を発見した：コードインタプリタを提供することで、より能力の高いモデルが探索から搾取へと早期に移行し、最適ではない解に満足してしまうことがある。これらの結果は、複雑でインタラクティブな環境における堅牢で汎用的な発見が依然として中核的な課題であることを示している。スケーラブルで堅牢かつ科学的に本物のテストベッドを提供することで、NewtonBenchは真の進歩を測定し、本物の科学的発見が可能な次世代AIエージェントの開発を導くための重要なツールを提供する。

English

Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using metaphysical shifts - systematic alterations of canonical laws - to generate a vast suite of problems that are scalable, scientifically relevant, and memorization-resistant. Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive experiment reveals a clear but fragile capability for discovery in frontier LLMs: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next-generation AI agents capable of genuine scientific discovery.

NewtonBench: LLMエージェントにおける一般化可能な科学法則発見のベンチマーク

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

要旨

Support