DeepPHY: 物理推論におけるエージェンシックVLMのベンチマーキング

要旨

視覚言語モデル（VLMs）は強力な知覚能力と印象的な視覚的推論を示すものの、複雑で動的な環境における細部への注意と正確な行動計画に苦戦し、性能が低いことが多い。現実世界のタスクは通常、複雑な相互作用、高度な空間推論、長期的な計画、そして継続的な戦略の洗練を必要とし、対象シナリオの物理法則を理解することが求められる。しかし、これらの能力を現実世界のシナリオで評価することはしばしば非常に高コストである。このギャップを埋めるため、我々はDeepPHYを導入する。これは、一連の挑戦的なシミュレーション環境を通じて、VLMsの基本的な物理原則の理解と推論を体系的に評価するための新しいベンチマークフレームワークである。DeepPHYは、難易度の異なる複数の物理推論環境を統合し、細かい評価指標を組み込んでいる。我々の評価では、最先端のVLMsでさえ、記述的な物理的知識を正確で予測可能な制御に変換することに苦戦していることが明らかとなった。

English

Although Vision Language Models (VLMs) exhibit strong perceptual abilities and impressive visual reasoning, they struggle with attention to detail and precise action planning in complex, dynamic environments, leading to subpar performance. Real-world tasks typically require complex interactions, advanced spatial reasoning, long-term planning, and continuous strategy refinement, usually necessitating understanding the physics rules of the target scenario. However, evaluating these capabilities in real-world scenarios is often prohibitively expensive. To bridge this gap, we introduce DeepPHY, a novel benchmark framework designed to systematically evaluate VLMs' understanding and reasoning about fundamental physical principles through a series of challenging simulated environments. DeepPHY integrates multiple physical reasoning environments of varying difficulty levels and incorporates fine-grained evaluation metrics. Our evaluation finds that even state-of-the-art VLMs struggle to translate descriptive physical knowledge into precise, predictive control.

DeepPHY: 物理推論におけるエージェンシックVLMのベンチマーキング

DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

要旨

Support