PhysBench：為物理世界理解基於視覺和語言的模型進行基準測試和增強

摘要

在具體化人工智慧中，理解物理世界是一個基本挑戰，對於使代理人能夠執行複雜任務並在現實環境中安全運作至關重要。雖然視覺語言模型（VLMs）在推理和任務規劃方面顯示出巨大潛力，但它們對於理解物理現象的能力仍然非常有限。為了彌補這一差距，我們引入了PhysBench，這是一個全面的基準測試，旨在評估VLMs在各種任務中對物理世界理解能力。PhysBench包含10,002條交錯的視頻-圖像-文本數據，分為四個主要領域：物理物體特性、物理物體關係、物理場景理解和基於物理的動態，進一步劃分為19個子類和8個不同的能力維度。我們進行了大量實驗，對75個代表性的VLMs進行了測試，結果顯示，儘管這些模型在常識推理方面表現出色，但它們在理解物理世界方面仍然困難--這可能是由於它們的訓練數據中缺乏物理知識和缺乏嵌入式物理先驗知識所致。為了應對這一不足，我們引入了PhysAgent，這是一個結合了VLMs的泛化優勢和視覺模型專業知識的新框架，顯著增強了VLMs在各種任務中對物理理解的能力，包括GPT-4o的18.4％改進。此外，我們的結果表明，增強VLMs對物理世界理解能力可以幫助像MOKA這樣的具體化代理人。我們相信，PhysBench和PhysAgent提供了寶貴的見解，有助於彌合VLMs和物理世界理解之間的差距。

English

Understanding the physical world is a fundamental challenge in embodied AI, critical for enabling agents to perform complex tasks and operate safely in real-world environments. While Vision-Language Models (VLMs) have shown great promise in reasoning and task planning for embodied agents, their ability to comprehend physical phenomena remains extremely limited. To close this gap, we introduce PhysBench, a comprehensive benchmark designed to evaluate VLMs' physical world understanding capability across a diverse set of tasks. PhysBench contains 10,002 entries of interleaved video-image-text data, categorized into four major domains: physical object properties, physical object relationships, physical scene understanding, and physics-based dynamics, further divided into 19 subclasses and 8 distinct capability dimensions. Our extensive experiments, conducted on 75 representative VLMs, reveal that while these models excel in common-sense reasoning, they struggle with understanding the physical world -- likely due to the absence of physical knowledge in their training data and the lack of embedded physical priors. To tackle the shortfall, we introduce PhysAgent, a novel framework that combines the generalization strengths of VLMs with the specialized expertise of vision models, significantly enhancing VLMs' physical understanding across a variety of tasks, including an 18.4\% improvement on GPT-4o. Furthermore, our results demonstrate that enhancing VLMs' physical world understanding capabilities can help embodied agents such as MOKA. We believe that PhysBench and PhysAgent offer valuable insights and contribute to bridging the gap between VLMs and physical world understanding.

PhysBench：為物理世界理解基於視覺和語言的模型進行基準測試和增強

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

摘要

Support