ビデオ言語プランニング

要旨

私たちは、インターネット規模のデータで事前学習された大規模生成モデルの最近の進歩を活用し、生成されたビデオと言語の空間における複雑な長期タスクのための視覚的計画を実現することに興味を持っています。この目的のために、ビデオ言語計画（VLP）を提案します。VLPは、ツリー検索手順からなるアルゴリズムであり、（i）ポリシーと価値関数の両方として機能する視覚言語モデルを訓練し、（ii）ダイナミクスモデルとしてテキストからビデオへのモデルを訓練します。VLPは、長期タスクの指示と現在の画像観測を入力として受け取り、最終タスクを完了する方法を詳細に記述したマルチモーダル（ビデオと言語）仕様を提供する長いビデオ計画を出力します。VLPは計算予算の増加に伴ってスケールし、より多くの計算時間が改善されたビデオ計画をもたらし、異なるロボティクス領域にわたる長期ビデオ計画を合成することができます：多オブジェクトの再配置から、多カメラの両腕器用操作まで。生成されたビデオ計画は、生成されたビデオの各中間フレームに条件付けされた目標条件付きポリシーを介して、実際のロボットアクションに変換することができます。実験結果は、VLPがシミュレーションおよび実ロボット（3つのハードウェアプラットフォームにわたる）の両方において、従来の方法と比較して長期タスクの成功率を大幅に向上させることを示しています。

English

We are interested in enabling visual planning for complex long-horizon tasks in the space of generated videos and language, leveraging recent advances in large generative models pretrained on Internet-scale data. To this end, we present video language planning (VLP), an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models. VLP takes as input a long-horizon task instruction and current image observation, and outputs a long video plan that provides detailed multimodal (video and language) specifications that describe how to complete the final task. VLP scales with increasing computation budget where more computation time results in improved video plans, and is able to synthesize long-horizon video plans across different robotics domains: from multi-object rearrangement, to multi-camera bi-arm dexterous manipulation. Generated video plans can be translated into real robot actions via goal-conditioned policies, conditioned on each intermediate frame of the generated video. Experiments show that VLP substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots (across 3 hardware platforms).

ビデオ言語プランニング

Video Language Planning

要旨

Support