iVideoGPT：インタラクティブVideoGPTはスケーラブルな世界モデルである

要旨

ワールドモデルは、モデルベースのエージェントが想像上の環境内でインタラクティブに探索、推論、計画を行い、現実世界の意思決定を可能にします。しかし、インタラクティブ性の高い要求は、ビデオ生成モデルの最近の進展を大規模なワールドモデルの開発に活用する上で課題を生んでいます。本研究では、Interactive VideoGPT（iVideoGPT）を紹介します。これは、視覚的観察、行動、報酬といったマルチモーダル信号をトークンのシーケンスに統合し、次のトークン予測を通じてエージェントのインタラクティブな体験を促進するスケーラブルな自己回帰型トランスフォーマーフレームワークです。iVideoGPTは、高次元の視覚的観察を効率的に離散化する新しい圧縮トークン化技術を特徴としています。そのスケーラブルなアーキテクチャを活用し、数百万の人間およびロボットの操作軌跡に対してiVideoGPTを事前学習させ、幅広い下流タスクのインタラクティブなワールドモデルとして適応可能な汎用的な基盤を確立しました。これには、行動条件付きビデオ予測、視覚的計画、モデルベース強化学習が含まれ、iVideoGPTは最先端の手法と比較して競争力のある性能を達成しています。本研究は、生成ビデオモデルと実用的なモデルベース強化学習アプリケーションの間のギャップを埋めるインタラクティブな汎用ワールドモデルの開発を推進します。

English

World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications.

iVideoGPT：インタラクティブVideoGPTはスケーラブルな世界モデルである

iVideoGPT: Interactive VideoGPTs are Scalable World Models

要旨

Support