反射的計画：多段階長期視野ロボット操作のための視覚言語モデル

要旨

複雑な長期視野のロボット操作問題を解決するためには、高度な計画能力、物理世界に関する推論能力、そして適切なモータースキルを反応的に選択する能力が必要です。インターネットデータで事前学習された視覚言語モデル（VLM）は、原理的にはこのような問題に取り組むためのフレームワークを提供できる可能性があります。しかし、現状のVLMは、ロボット操作に必要な複雑な物理現象の微妙な理解と、エラーの累積問題に対処するための長期視野にわたる推論能力の両方を欠いています。本論文では、多段階操作タスクにおけるVLMの物理推論能力を強化する新しいテスト時計算フレームワークを紹介します。私たちのアプローチの中核は、事前学習されたVLMを「リフレクション」メカニズムを用いて反復的に改善することです。具体的には、生成モデルを使用して将来の世界の状態を想像し、これらの予測を活用して行動選択を導き、重要なことに、潜在的な最適性の欠如を反映して推論を洗練させます。実験結果は、私たちの手法がいくつかの最先端の商用VLMや、モンテカルロ木探索（MCTS）などの他の事後学習アプローチを大幅に上回ることを示しています。動画はhttps://reflect-vlm.github.ioでご覧いただけます。

English

Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills. Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems. However, in their current form, VLMs lack both the nuanced understanding of intricate physics required for robotic manipulation and the ability to reason over long horizons to address error compounding issues. In this paper, we introduce a novel test-time computation framework that enhances VLMs' physical reasoning capabilities for multi-stage manipulation tasks. At its core, our approach iteratively improves a pretrained VLM with a "reflection" mechanism - it uses a generative model to imagine future world states, leverages these predictions to guide action selection, and critically reflects on potential suboptimalities to refine its reasoning. Experimental results demonstrate that our method significantly outperforms several state-of-the-art commercial VLMs as well as other post-training approaches such as Monte Carlo Tree Search (MCTS). Videos are available at https://reflect-vlm.github.io.

反射的計画：多段階長期視野ロボット操作のための視覚言語モデル

Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation

要旨

Support