Pandora：通过自然语言操作和视频状态实现通用世界模型

摘要

世界模型根据不同的行动模拟未来的世界状态。它们促进了交互式内容的创建，并为基于现实、长期推理提供了基础。当前的基础模型并未完全满足通用世界模型的能力要求：大型语言模型（LLMs）受限于对语言形式的依赖以及对物理世界的有限理解，而视频模型则缺乏对世界模拟的交互式行动控制。本文通过引入 Pandora，一种混合自回归扩散模型，迈出了构建通用世界模型的一步，该模型通过生成视频模拟世界状态，并允许通过自由文本行动进行实时控制。Pandora 通过大规模预训练和指导调整实现了领域通用性、视频一致性和可控性。关键是，Pandora 通过集成预训练的大型语言模型（7B）和预训练的视频模型，避免了从头开始训练的成本，仅需要额外的轻量级微调。我们展示了 Pandora 在不同领域（室内/室外、自然/城市、人类/机器人、2D/3D 等）的广泛输出。结果表明，通过更大规模的训练，构建更强大的通用世界模型具有巨大潜力。

English

World models simulate future states of the world in response to different actions. They facilitate interactive content creation and provides a foundation for grounded, long-horizon reasoning. Current foundation models do not fully meet the capabilities of general world models: large language models (LLMs) are constrained by their reliance on language modality and their limited understanding of the physical world, while video models lack interactive action control over the world simulations. This paper makes a step towards building a general world model by introducing Pandora, a hybrid autoregressive-diffusion model that simulates world states by generating videos and allows real-time control with free-text actions. Pandora achieves domain generality, video consistency, and controllability through large-scale pretraining and instruction tuning. Crucially, Pandora bypasses the cost of training-from-scratch by integrating a pretrained LLM (7B) and a pretrained video model, requiring only additional lightweight finetuning. We illustrate extensive outputs by Pandora across diverse domains (indoor/outdoor, natural/urban, human/robot, 2D/3D, etc.). The results indicate great potential of building stronger general world models with larger-scale training.

Pandora：通过自然语言操作和视频状态实现通用世界模型

Pandora: Towards General World Model with Natural Language Actions and Video States

摘要

Support