基於物理基礎的視覺語言模型用於機器人操作

摘要

最近在視覺語言模型（VLMs）方面取得的進展已經提高了在視覺問答和圖像標註等任務上的表現。因此，這些模型現在已經能夠很好地推理物理世界，特別是在領域如機器人操作中。然而，目前的VLMs在對常見物體的物理概念（例如材質、脆弱性）的理解方面存在限制，這限制了它們在涉及與這些物體互動和進行物理推理的機器人操作任務中的實用性。為了解決這一限制，我們提出了PhysObjects，這是一個以物體為中心的數據集，包含了36.9K個眾包和417K個自動化的常見家庭物體的物理概念標註。我們展示了在PhysObjects上微調VLM可以提高對物理物體概念的理解，通過從視覺外觀中捕捉這些概念的人類先驗知識。我們將這種基於物理的VLM納入一個互動框架中，該框架搭配一個基於大型語言模型的機器人規劃器，並展示了在需要推理物理物體概念的任務上，與不利用基於物理的VLMs的基準相比，規劃性能得到了改善。此外，我們還展示了基於物理的VLM在真實機器人上的好處，它提高了任務成功率。我們釋出了我們的數據集，並在https://iliad.stanford.edu/pg-vlm/提供進一步的細節和結果的可視化。

English

Recent advances in vision-language models (VLMs) have led to improved performance on tasks such as visual question answering and image captioning. Consequently, these models are now well-positioned to reason about the physical world, particularly within domains such as robotic manipulation. However, current VLMs are limited in their understanding of the physical concepts (e.g., material, fragility) of common objects, which restricts their usefulness for robotic manipulation tasks that involve interaction and physical reasoning about such objects. To address this limitation, we propose PhysObjects, an object-centric dataset of 36.9K crowd-sourced and 417K automated physical concept annotations of common household objects. We demonstrate that fine-tuning a VLM on PhysObjects improves its understanding of physical object concepts, by capturing human priors of these concepts from visual appearance. We incorporate this physically-grounded VLM in an interactive framework with a large language model-based robotic planner, and show improved planning performance on tasks that require reasoning about physical object concepts, compared to baselines that do not leverage physically-grounded VLMs. We additionally illustrate the benefits of our physically-grounded VLM on a real robot, where it improves task success rates. We release our dataset and provide further details and visualizations of our results at https://iliad.stanford.edu/pg-vlm/.

基於物理基礎的視覺語言模型用於機器人操作

Physically Grounded Vision-Language Models for Robotic Manipulation

摘要

Support