iVideoGPT:互動式VideoGPT是可擴展的世界模型。
iVideoGPT: Interactive VideoGPTs are Scalable World Models
May 24, 2024
作者: Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, Mingsheng Long
cs.AI
摘要
世界模型賦予基於模型的代理人在想像環境中進行互動式探索、推理和規劃,以進行真實世界的決策。然而,對互動性的高需求在利用最近視頻生成模型的進展來開發大規模的世界模型時存在挑戰。本研究介紹了互動式VideoGPT(iVideoGPT),這是一個可擴展的自回歸變壓器框架,將多模態信號--視覺觀察、動作和獎勵--整合到一個令牌序列中,促進通過下一個令牌預測的代理人的互動體驗。iVideoGPT具有一種新穎的壓縮式標記化技術,可以有效離散化高維視覺觀察。利用其可擴展的架構,我們能夠在數百萬人類和機器人操作軌跡上預先訓練iVideoGPT,建立一個多才多藝的基礎,可適應作為各種下游任務的互動式世界模型。這些任務包括動作條件下的視頻預測、視覺規劃和基於模型的強化學習,iVideoGPT在這些任務中與最先進的方法相比取得了競爭性表現。我們的工作推動了互動式通用世界模型的發展,彌合了生成式視頻模型與實際基於模型的強化學習應用之間的差距。
English
World models empower model-based agents to interactively explore, reason, and
plan within imagined environments for real-world decision-making. However, the
high demand for interactivity poses challenges in harnessing recent
advancements in video generative models for developing world models at scale.
This work introduces Interactive VideoGPT (iVideoGPT), a scalable
autoregressive transformer framework that integrates multimodal signals--visual
observations, actions, and rewards--into a sequence of tokens, facilitating an
interactive experience of agents via next-token prediction. iVideoGPT features
a novel compressive tokenization technique that efficiently discretizes
high-dimensional visual observations. Leveraging its scalable architecture, we
are able to pre-train iVideoGPT on millions of human and robotic manipulation
trajectories, establishing a versatile foundation that is adaptable to serve as
interactive world models for a wide range of downstream tasks. These include
action-conditioned video prediction, visual planning, and model-based
reinforcement learning, where iVideoGPT achieves competitive performance
compared with state-of-the-art methods. Our work advances the development of
interactive general world models, bridging the gap between generative video
models and practical model-based reinforcement learning applications.