物理的に接地された視覚言語モデルによるロボットマニピュレーション

要旨

最近の視覚言語モデル（VLM）の進展により、視覚的質問応答や画像キャプション生成などのタスクにおける性能が向上しています。その結果、これらのモデルは物理世界、特にロボット操作などの領域において推論を行うのに適した位置づけとなりました。しかし、現在のVLMは、一般的な物体の物理的概念（例えば、材質、脆弱性）の理解において制限があり、これがそのような物体との相互作用や物理的推論を必要とするロボット操作タスクにおける有用性を制約しています。この制限に対処するため、我々はPhysObjectsを提案します。これは、36.9Kのクラウドソーシングおよび417Kの自動化された物理的概念アノテーションを含む、一般的な家庭用品に焦点を当てたデータセットです。我々は、PhysObjectsでVLMをファインチューニングすることで、視覚的外観からこれらの概念に関する人間の事前知識を捉え、物理的オブジェクト概念の理解が向上することを実証します。この物理的基盤を持つVLMを、大規模言語モデルベースのロボットプランナーとのインタラクティブなフレームワークに組み込み、物理的オブジェクト概念に関する推論を必要とするタスクにおいて、物理的基盤を持たないベースラインと比較して計画性能が向上することを示します。さらに、我々は物理的基盤を持つVLMが実機ロボットにおいてタスクの成功率を向上させる利点を実証します。我々はデータセットを公開し、結果の詳細と可視化をhttps://iliad.stanford.edu/pg-vlm/で提供します。

English

Recent advances in vision-language models (VLMs) have led to improved performance on tasks such as visual question answering and image captioning. Consequently, these models are now well-positioned to reason about the physical world, particularly within domains such as robotic manipulation. However, current VLMs are limited in their understanding of the physical concepts (e.g., material, fragility) of common objects, which restricts their usefulness for robotic manipulation tasks that involve interaction and physical reasoning about such objects. To address this limitation, we propose PhysObjects, an object-centric dataset of 36.9K crowd-sourced and 417K automated physical concept annotations of common household objects. We demonstrate that fine-tuning a VLM on PhysObjects improves its understanding of physical object concepts, by capturing human priors of these concepts from visual appearance. We incorporate this physically-grounded VLM in an interactive framework with a large language model-based robotic planner, and show improved planning performance on tasks that require reasoning about physical object concepts, compared to baselines that do not leverage physically-grounded VLMs. We additionally illustrate the benefits of our physically-grounded VLM on a real robot, where it improves task success rates. We release our dataset and provide further details and visualizations of our results at https://iliad.stanford.edu/pg-vlm/.

物理的に接地された視覚言語モデルによるロボットマニピュレーション

Physically Grounded Vision-Language Models for Robotic Manipulation

要旨

Support