PhysBench: 物理世界理解のためのビジョン言語モデルのベンチマークおよび強化

要旨

物理世界を理解することは、具体的なエージェントが複雑なタスクを実行し、現実世界で安全に操作するために不可欠な基本的な課題です。ビジョン・ランゲージ・モデル（VLMs）は、具体的なエージェントの推論やタスク計画において大きな可能性を示していますが、物理現象を理解する能力は非常に限られています。このギャップを埋めるために、VLMsの物理世界理解能力を評価するために設計された包括的なベンチマークであるPhysBenchを紹介します。PhysBenchには、4つの主要な領域に分類された、10,002のビデオ画像テキストデータが交互に含まれており、物理オブジェクトの特性、物理オブジェクトの関係、物理シーン理解、物理ベースのダイナミクスに分かれており、さらに19のサブクラスと8つの異なる能力次元に分割されています。75の代表的なVLMsで実施された幅広い実験により、これらのモデルが常識的な推論に優れている一方で、物理世界を理解するのに苦労していることが明らかになりました。これは、彼らの訓練データに物理的な知識が欠如していることや、埋め込まれた物理的な先行知識が不足していることが原因である可能性が高いです。この不足に対処するために、VLMsの一般化の強みとビジョンモデルの専門知識を組み合わせた新しいフレームワークであるPhysAgentを紹介します。これにより、GPT-4oで18.4％の改善を含むさまざまなタスクにおいて、VLMsの物理理解が大幅に向上します。さらに、我々の結果は、VLMsの物理世界理解能力を向上させることが、MOKAなどの具体的なエージェントに役立つことを示しています。PhysBenchとPhysAgentは、VLMsと物理世界理解との間のギャップを埋めるために貴重な示唆を提供し、貢献すると考えています。

English

Understanding the physical world is a fundamental challenge in embodied AI, critical for enabling agents to perform complex tasks and operate safely in real-world environments. While Vision-Language Models (VLMs) have shown great promise in reasoning and task planning for embodied agents, their ability to comprehend physical phenomena remains extremely limited. To close this gap, we introduce PhysBench, a comprehensive benchmark designed to evaluate VLMs' physical world understanding capability across a diverse set of tasks. PhysBench contains 10,002 entries of interleaved video-image-text data, categorized into four major domains: physical object properties, physical object relationships, physical scene understanding, and physics-based dynamics, further divided into 19 subclasses and 8 distinct capability dimensions. Our extensive experiments, conducted on 75 representative VLMs, reveal that while these models excel in common-sense reasoning, they struggle with understanding the physical world -- likely due to the absence of physical knowledge in their training data and the lack of embedded physical priors. To tackle the shortfall, we introduce PhysAgent, a novel framework that combines the generalization strengths of VLMs with the specialized expertise of vision models, significantly enhancing VLMs' physical understanding across a variety of tasks, including an 18.4\% improvement on GPT-4o. Furthermore, our results demonstrate that enhancing VLMs' physical world understanding capabilities can help embodied agents such as MOKA. We believe that PhysBench and PhysAgent offer valuable insights and contribute to bridging the gap between VLMs and physical world understanding.

PhysBench: 物理世界理解のためのビジョン言語モデルのベンチマークおよび強化

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

要旨

Support