Reflexives Planen: Vision-Sprach-Modelle für mehrstufige Langzeit-Roboter-Manipulation

papers.abstract

Die Lösung komplexer langfristiger robotischer Manipulationsprobleme erfordert ausgefeilte Planungsfähigkeiten auf hohem Niveau, die Fähigkeit, über die physische Welt nachzudenken und reaktiv geeignete motorische Fähigkeiten auszuwählen. Vision-Language-Modelle (VLMs), die auf Internetdaten vortrainiert sind, könnten grundsätzlich einen Rahmen für die Bewältigung solcher Probleme bieten. Allerdings fehlt den aktuellen VLMs sowohl das differenzierte Verständnis der komplexen Physik, die für die robotische Manipulation erforderlich ist, als auch die Fähigkeit, über lange Zeithorizonte zu denken, um Fehlerkumulierungsprobleme anzugehen. In diesem Paper stellen wir ein neuartiges Testzeit-Berechnungsframework vor, das die physikalischen Denkfähigkeiten von VLMs für mehrstufige Manipulationsaufgaben verbessert. Im Kern verbessert unser Ansatz iterativ ein vortrainiertes VLM mit einem "Reflexionsmechanismus" - er verwendet ein generatives Modell, um zukünftige Weltzustände zu imaginieren, nutzt diese Vorhersagen, um die Auswahl von Aktionen zu lenken, und reflektiert kritisch über potenzielle Suboptimalitäten, um sein Denken zu verfeinern. Experimentelle Ergebnisse zeigen, dass unsere Methode signifikant besser abschneidet als mehrere führende kommerzielle VLMs sowie andere Post-Training-Ansätze wie Monte-Carlo-Baumsuche (MCTS). Videos sind verfügbar unter https://reflect-vlm.github.io.

English

Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills. Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems. However, in their current form, VLMs lack both the nuanced understanding of intricate physics required for robotic manipulation and the ability to reason over long horizons to address error compounding issues. In this paper, we introduce a novel test-time computation framework that enhances VLMs' physical reasoning capabilities for multi-stage manipulation tasks. At its core, our approach iteratively improves a pretrained VLM with a "reflection" mechanism - it uses a generative model to imagine future world states, leverages these predictions to guide action selection, and critically reflects on potential suboptimalities to refine its reasoning. Experimental results demonstrate that our method significantly outperforms several state-of-the-art commercial VLMs as well as other post-training approaches such as Monte Carlo Tree Search (MCTS). Videos are available at https://reflect-vlm.github.io.

Reflexives Planen: Vision-Sprach-Modelle für mehrstufige Langzeit-Roboter-Manipulation

Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation

papers.abstract

Support