反思式規劃：視覺語言模型在多階段長程機器人操作中的應用

摘要

解決複雜的長期機器人操作問題，需要具備高階規劃能力、對物理世界的理解能力，以及能夠反應性地選擇適當的運動技能。基於互聯網數據預訓練的視覺-語言模型（VLMs）原則上可以提供一個解決此類問題的框架。然而，目前形式的VLMs既缺乏對機器人操作所需的細緻物理理解，也無法進行長期推理以應對錯誤累積的問題。本文提出了一種新穎的測試時計算框架，該框架增強了VLMs在多階段操作任務中的物理推理能力。我們方法的核心是通過「反思」機制迭代改進預訓練的VLM——利用生成模型預測未來的世界狀態，並基於這些預測來指導動作選擇，同時關鍵性地反思潛在的次優性以精煉其推理。實驗結果表明，我們的方法在多個最先進的商業VLMs以及其他後訓練方法（如蒙特卡洛樹搜索，MCTS）中表現顯著優異。相關視頻可在https://reflect-vlm.github.io查看。

English

Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills. Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems. However, in their current form, VLMs lack both the nuanced understanding of intricate physics required for robotic manipulation and the ability to reason over long horizons to address error compounding issues. In this paper, we introduce a novel test-time computation framework that enhances VLMs' physical reasoning capabilities for multi-stage manipulation tasks. At its core, our approach iteratively improves a pretrained VLM with a "reflection" mechanism - it uses a generative model to imagine future world states, leverages these predictions to guide action selection, and critically reflects on potential suboptimalities to refine its reasoning. Experimental results demonstrate that our method significantly outperforms several state-of-the-art commercial VLMs as well as other post-training approaches such as Monte Carlo Tree Search (MCTS). Videos are available at https://reflect-vlm.github.io.

反思式規劃：視覺語言模型在多階段長程機器人操作中的應用

Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation

摘要

Support