Pandora:通过自然语言操作和视频状态实现通用世界模型
Pandora: Towards General World Model with Natural Language Actions and Video States
June 12, 2024
作者: Jiannan Xiang, Guangyi Liu, Yi Gu, Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, Zhengzhong Liu, Eric P. Xing, Zhiting Hu
cs.AI
摘要
世界模型根据不同的行动模拟未来的世界状态。它们促进了交互式内容的创建,并为基于现实、长期推理提供了基础。当前的基础模型并未完全满足通用世界模型的能力要求:大型语言模型(LLMs)受限于对语言形式的依赖以及对物理世界的有限理解,而视频模型则缺乏对世界模拟的交互式行动控制。本文通过引入 Pandora,一种混合自回归扩散模型,迈出了构建通用世界模型的一步,该模型通过生成视频模拟世界状态,并允许通过自由文本行动进行实时控制。Pandora 通过大规模预训练和指导调整实现了领域通用性、视频一致性和可控性。关键是,Pandora 通过集成预训练的大型语言模型(7B)和预训练的视频模型,避免了从头开始训练的成本,仅需要额外的轻量级微调。我们展示了 Pandora 在不同领域(室内/室外、自然/城市、人类/机器人、2D/3D 等)的广泛输出。结果表明,通过更大规模的训练,构建更强大的通用世界模型具有巨大潜力。
English
World models simulate future states of the world in response to different
actions. They facilitate interactive content creation and provides a foundation
for grounded, long-horizon reasoning. Current foundation models do not fully
meet the capabilities of general world models: large language models (LLMs) are
constrained by their reliance on language modality and their limited
understanding of the physical world, while video models lack interactive action
control over the world simulations. This paper makes a step towards building a
general world model by introducing Pandora, a hybrid autoregressive-diffusion
model that simulates world states by generating videos and allows real-time
control with free-text actions. Pandora achieves domain generality, video
consistency, and controllability through large-scale pretraining and
instruction tuning. Crucially, Pandora bypasses the cost of
training-from-scratch by integrating a pretrained LLM (7B) and a pretrained
video model, requiring only additional lightweight finetuning. We illustrate
extensive outputs by Pandora across diverse domains (indoor/outdoor,
natural/urban, human/robot, 2D/3D, etc.). The results indicate great potential
of building stronger general world models with larger-scale training.Summary
AI-Generated Summary