PhysGym：在受控先验条件下对大型语言模型进行交互式物理发现的基准测试

摘要

評估基於大型語言模型的代理在科學發現方面的能力，尤其是它們如何應對不同環境複雜性並利用先驗知識，目前尚缺乏專門的基準測試。為填補這一空白，我們引入了PhysGym，這是一個新穎的基準測試套件和模擬平台，旨在嚴格評估基於大型語言模型的科學推理在互動物理環境中的表現。PhysGym的主要貢獻在於其對提供給代理的先驗知識水平的精細控制。這使得研究人員能夠沿著問題複雜性和先驗知識水平等軸線剖析代理的表現。該基準測試包含一系列互動模擬，其中代理必須主動探測環境，在約束下順序收集數據，並對潛在的物理定律提出假設。PhysGym提供了標準化的評估協議和指標，用於評估假設的準確性和模型的保真度。我們通過展示基線大型語言模型的結果，展示了該基準測試在基於不同先驗知識和任務複雜性區分能力方面的實用性。

English

Evaluating the scientific discovery capabilities of large language model based agents, particularly how they cope with varying environmental complexity and utilize prior knowledge, requires specialized benchmarks currently lacking in the landscape. To address this gap, we introduce PhysGym, a novel benchmark suite and simulation platform for rigorously assessing LLM-based scientific reasoning in interactive physics environments. PhysGym's primary contribution lies in its sophisticated control over the level of prior knowledge provided to the agent. This allows researchers to dissect agent performance along axes including the complexity of the problem and the prior knowledge levels. The benchmark comprises a suite of interactive simulations, where agents must actively probe environments, gather data sequentially under constraints and formulate hypotheses about underlying physical laws. PhysGym provides standardized evaluation protocols and metrics for assessing hypothesis accuracy and model fidelity. We demonstrate the benchmark's utility by presenting results from baseline LLMs, showcasing its ability to differentiate capabilities based on varying priors and task complexity.

PhysGym：在受控先验条件下对大型语言模型进行交互式物理发现的基准测试

PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors

摘要

Support