iVideoGPT：交互式VideoGPT是可扩展的世界模型。

摘要

世界模型赋予基于模型的智能体在虚拟环境中进行互动探索、推理和规划，以进行真实世界决策。然而，对互动性的高需求在利用最新视频生成模型开发大规模世界模型方面提出了挑战。本研究引入了交互式VideoGPT（iVideoGPT），这是一个可扩展的自回归Transformer框架，将多模态信号——视觉观察、动作和奖励——整合到一个令牌序列中，通过下一个令牌预测促进智能体的互动体验。iVideoGPT采用一种新颖的压缩式令牌化技术，有效离散化高维视觉观察。利用其可扩展的架构，我们能够在数百万人类和机器人操作轨迹上对iVideoGPT进行预训练，建立一个多才多艺的基础，可适应于作为各种下游任务的互动式世界模型。这些任务包括动作条件视频预测、视觉规划和基于模型的强化学习，iVideoGPT在这些任务中表现出与最先进方法相媲美的性能。我们的工作推动了交互式通用世界模型的发展，弥合了生成式视频模型与实际基于模型的强化学习应用之间的差距。

English

World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications.

iVideoGPT：交互式VideoGPT是可扩展的世界模型。

iVideoGPT: Interactive VideoGPTs are Scalable World Models

摘要

Support