반사적 계획: 다단계 장기간 로봇 조작을 위한 비전-언어 모델

초록

복잡한 장기적 로봇 조작 문제를 해결하기 위해서는 정교한 고수준 계획 능력, 물리적 세계에 대한 추론 능력, 그리고 적절한 모터 스킬을 반응적으로 선택할 수 있는 능력이 필요합니다. 인터넷 데이터로 사전 학습된 시각-언어 모델(VLMs)은 원칙적으로 이러한 문제를 해결하기 위한 프레임워크를 제공할 수 있습니다. 그러나 현재 형태의 VLMs은 로봇 조작에 필요한 복잡한 물리학에 대한 미묘한 이해와 오류 누적 문제를 해결하기 위한 장기적 추론 능력이 모두 부족합니다. 본 논문에서는 다단계 조작 작업을 위한 VLMs의 물리적 추론 능력을 향상시키는 새로운 테스트 시점 계산 프레임워크를 소개합니다. 우리의 접근 방식의 핵심은 "반성" 메커니즘을 통해 사전 학습된 VLM을 반복적으로 개선하는 것입니다. 이는 생성 모델을 사용하여 미래 세계 상태를 상상하고, 이러한 예측을 활용하여 행동 선택을 안내하며, 잠재적인 차선책에 대해 비판적으로 반성하여 추론을 개선합니다. 실험 결과는 우리의 방법이 여러 최신 상용 VLMs 및 몬테카를로 트리 탐색(MCTS)과 같은 다른 사후 학습 접근법을 크게 능가함을 보여줍니다. 비디오는 https://reflect-vlm.github.io에서 확인할 수 있습니다.

English

Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills. Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems. However, in their current form, VLMs lack both the nuanced understanding of intricate physics required for robotic manipulation and the ability to reason over long horizons to address error compounding issues. In this paper, we introduce a novel test-time computation framework that enhances VLMs' physical reasoning capabilities for multi-stage manipulation tasks. At its core, our approach iteratively improves a pretrained VLM with a "reflection" mechanism - it uses a generative model to imagine future world states, leverages these predictions to guide action selection, and critically reflects on potential suboptimalities to refine its reasoning. Experimental results demonstrate that our method significantly outperforms several state-of-the-art commercial VLMs as well as other post-training approaches such as Monte Carlo Tree Search (MCTS). Videos are available at https://reflect-vlm.github.io.

반사적 계획: 다단계 장기간 로봇 조작을 위한 비전-언어 모델

Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation

초록

Support