反思式規劃:視覺語言模型在多階段長程機器人操作中的應用
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
February 23, 2025
作者: Yunhai Feng, Jiaming Han, Zhuoran Yang, Xiangyu Yue, Sergey Levine, Jianlan Luo
cs.AI
摘要
解決複雜的長期機器人操作問題,需要具備高階規劃能力、對物理世界的理解能力,以及能夠反應性地選擇適當的運動技能。基於互聯網數據預訓練的視覺-語言模型(VLMs)原則上可以提供一個解決此類問題的框架。然而,目前形式的VLMs既缺乏對機器人操作所需的細緻物理理解,也無法進行長期推理以應對錯誤累積的問題。本文提出了一種新穎的測試時計算框架,該框架增強了VLMs在多階段操作任務中的物理推理能力。我們方法的核心是通過「反思」機制迭代改進預訓練的VLM——利用生成模型預測未來的世界狀態,並基於這些預測來指導動作選擇,同時關鍵性地反思潛在的次優性以精煉其推理。實驗結果表明,我們的方法在多個最先進的商業VLMs以及其他後訓練方法(如蒙特卡洛樹搜索,MCTS)中表現顯著優異。相關視頻可在https://reflect-vlm.github.io查看。
English
Solving complex long-horizon robotic manipulation problems requires
sophisticated high-level planning capabilities, the ability to reason about the
physical world, and reactively choose appropriate motor skills. Vision-language
models (VLMs) pretrained on Internet data could in principle offer a framework
for tackling such problems. However, in their current form, VLMs lack both the
nuanced understanding of intricate physics required for robotic manipulation
and the ability to reason over long horizons to address error compounding
issues. In this paper, we introduce a novel test-time computation framework
that enhances VLMs' physical reasoning capabilities for multi-stage
manipulation tasks. At its core, our approach iteratively improves a pretrained
VLM with a "reflection" mechanism - it uses a generative model to imagine
future world states, leverages these predictions to guide action selection, and
critically reflects on potential suboptimalities to refine its reasoning.
Experimental results demonstrate that our method significantly outperforms
several state-of-the-art commercial VLMs as well as other post-training
approaches such as Monte Carlo Tree Search (MCTS). Videos are available at
https://reflect-vlm.github.io.Summary
AI-Generated Summary