비디오 언어 계획

초록

우리는 인터넷 규모의 데이터로 사전 학습된 대형 생성 모델의 최근 발전을 활용하여, 생성된 비디오와 언어 공간에서 복잡한 장기 과제를 위한 시각적 계획을 가능하게 하는 데 관심이 있습니다. 이를 위해 비디오 언어 계획(Video Language Planning, VLP) 알고리즘을 제안합니다. VLP는 트리 탐색 절차로 구성되며, 여기서 우리는 (i) 정책 및 가치 함수 역할을 하는 비전-언어 모델과 (ii) 동역학 모델 역할을 하는 텍스트-비디오 모델을 학습합니다. VLP는 장기 과제 지시와 현재 이미지 관측을 입력으로 받아, 최종 과제를 완료하는 방법을 설명하는 상세한 다중 모드(비디오 및 언어) 사양을 제공하는 긴 비디오 계획을 출력합니다. VLP는 계산 예산이 증가함에 따라 확장 가능하며, 더 많은 계산 시간이 더 나은 비디오 계획으로 이어집니다. 또한 다양한 로봇 공간에서 장기 비디오 계획을 합성할 수 있습니다: 다중 객체 재배치부터 다중 카메라 양팔 정밀 조작까지. 생성된 비디오 계획은 생성된 비디오의 각 중간 프레임에 조건화된 목표 조건 정책을 통해 실제 로봇 동작으로 변환될 수 있습니다. 실험 결과, VLP는 시뮬레이션 및 실제 로봇(3개의 하드웨어 플랫폼) 모두에서 기존 방법에 비해 장기 과제 성공률을 크게 향상시킵니다.

English

We are interested in enabling visual planning for complex long-horizon tasks in the space of generated videos and language, leveraging recent advances in large generative models pretrained on Internet-scale data. To this end, we present video language planning (VLP), an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models. VLP takes as input a long-horizon task instruction and current image observation, and outputs a long video plan that provides detailed multimodal (video and language) specifications that describe how to complete the final task. VLP scales with increasing computation budget where more computation time results in improved video plans, and is able to synthesize long-horizon video plans across different robotics domains: from multi-object rearrangement, to multi-camera bi-arm dexterous manipulation. Generated video plans can be translated into real robot actions via goal-conditioned policies, conditioned on each intermediate frame of the generated video. Experiments show that VLP substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots (across 3 hardware platforms).