基于物理基础的视觉语言模型用于机器人操作

摘要

最近视觉语言模型（VLMs）的进展已经提高了视觉问答和图像字幕等任务的性能。因此，这些模型现在能够很好地推理物理世界，特别是在诸如机器人操作等领域。然而，当前的VLMs在对常见物体的物理概念（例如材料、脆弱性）的理解方面存在局限，这限制了它们在涉及与这些物体的交互和物理推理的机器人操作任务中的实用性。为了解决这一局限性，我们提出了PhysObjects，这是一个以物体为中心的数据集，包含36.9K个众包和417K个自动化的常见家庭物体的物理概念注释。我们展示了在PhysObjects上微调VLM可以提高其对物理物体概念的理解，通过从视觉外观中捕捉这些概念的人类先验知识。我们将这个具有物理基础的VLM纳入一个交互框架中，该框架由基于大型语言模型的机器人规划器组成，并展示了在需要推理物理物体概念的任务上，与不利用具有物理基础的VLMs的基线相比，规划性能得到了改善。此外，我们还展示了具有物理基础的VLM在真实机器人上的好处，它提高了任务成功率。我们发布了我们的数据集，并在https://iliad.stanford.edu/pg-vlm/提供了进一步的细节和结果的可视化。

English

Recent advances in vision-language models (VLMs) have led to improved performance on tasks such as visual question answering and image captioning. Consequently, these models are now well-positioned to reason about the physical world, particularly within domains such as robotic manipulation. However, current VLMs are limited in their understanding of the physical concepts (e.g., material, fragility) of common objects, which restricts their usefulness for robotic manipulation tasks that involve interaction and physical reasoning about such objects. To address this limitation, we propose PhysObjects, an object-centric dataset of 36.9K crowd-sourced and 417K automated physical concept annotations of common household objects. We demonstrate that fine-tuning a VLM on PhysObjects improves its understanding of physical object concepts, by capturing human priors of these concepts from visual appearance. We incorporate this physically-grounded VLM in an interactive framework with a large language model-based robotic planner, and show improved planning performance on tasks that require reasoning about physical object concepts, compared to baselines that do not leverage physically-grounded VLMs. We additionally illustrate the benefits of our physically-grounded VLM on a real robot, where it improves task success rates. We release our dataset and provide further details and visualizations of our results at https://iliad.stanford.edu/pg-vlm/.

基于物理基础的视觉语言模型用于机器人操作

Physically Grounded Vision-Language Models for Robotic Manipulation

摘要

Support