PhysGym: 制御された事前知識を用いたインタラクティブな物理発見における大規模言語モデルのベンチマーキング

要旨

大規模言語モデル（LLM）ベースのエージェントの科学的発見能力、特に環境の複雑さの変化に対応し、事前知識を活用する方法を評価するためには、現在の研究環境では専門的なベンチマークが不足している。このギャップを埋めるため、我々はPhysGymを導入する。これは、インタラクティブな物理環境におけるLLMベースの科学的推論を厳密に評価するための新しいベンチマークスイートおよびシミュレーションプラットフォームである。PhysGymの主な貢献は、エージェントに提供される事前知識のレベルを高度に制御できる点にある。これにより、研究者は問題の複雑さや事前知識のレベルに沿ってエージェントの性能を詳細に分析することが可能となる。このベンチマークは、エージェントが環境を積極的に探査し、制約下で逐次的にデータを収集し、基礎となる物理法則に関する仮説を立てる必要がある一連のインタラクティブシミュレーションで構成されている。PhysGymは、仮説の正確性とモデルの忠実度を評価するための標準化された評価プロトコルとメトリクスを提供する。我々は、ベースラインLLMからの結果を示すことで、このベンチマークが異なる事前知識とタスクの複雑さに基づいて能力を区別する能力を実証する。

English

Evaluating the scientific discovery capabilities of large language model based agents, particularly how they cope with varying environmental complexity and utilize prior knowledge, requires specialized benchmarks currently lacking in the landscape. To address this gap, we introduce PhysGym, a novel benchmark suite and simulation platform for rigorously assessing LLM-based scientific reasoning in interactive physics environments. PhysGym's primary contribution lies in its sophisticated control over the level of prior knowledge provided to the agent. This allows researchers to dissect agent performance along axes including the complexity of the problem and the prior knowledge levels. The benchmark comprises a suite of interactive simulations, where agents must actively probe environments, gather data sequentially under constraints and formulate hypotheses about underlying physical laws. PhysGym provides standardized evaluation protocols and metrics for assessing hypothesis accuracy and model fidelity. We demonstrate the benchmark's utility by presenting results from baseline LLMs, showcasing its ability to differentiate capabilities based on varying priors and task complexity.

PhysGym: 制御された事前知識を用いたインタラクティブな物理発見における大規模言語モデルのベンチマーキング

PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors

要旨

Support