Pandora:朝向具備自然語言操作和視頻狀態的通用世界模型
Pandora: Towards General World Model with Natural Language Actions and Video States
June 12, 2024
作者: Jiannan Xiang, Guangyi Liu, Yi Gu, Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, Zhengzhong Liu, Eric P. Xing, Zhiting Hu
cs.AI
摘要
世界模型根據不同的行動模擬未來世界的狀態。它們促進互動式內容創作,為基於長期推理的基礎奠定了基礎。目前的基礎模型並未完全滿足通用世界模型的能力:大型語言模型(LLMs)受限於對語言形式的依賴以及對物理世界的有限理解,而視頻模型則缺乏對世界模擬的互動行動控制。本文通過引入 Pandora,一種混合自回歸擴散模型,向構建通用世界模型邁出了一步,該模型通過生成視頻模擬世界狀態,並允許通過自由文本行動進行實時控制。Pandora 通過大規模預訓練和指導微調實現了領域通用性、視頻一致性和可控性。至關重要的是,Pandora 通過整合預訓練的 LLM(7B)和預訓練的視頻模型,避免了從頭開始訓練的成本,僅需要進行輕量級微調。我們展示了 Pandora 在不同領域(室內/室外、自然/城市、人類/機器人、2D/3D 等)的廣泛輸出。結果表明,通過更大規模的訓練,構建更強大的通用世界模型具有巨大潛力。
English
World models simulate future states of the world in response to different
actions. They facilitate interactive content creation and provides a foundation
for grounded, long-horizon reasoning. Current foundation models do not fully
meet the capabilities of general world models: large language models (LLMs) are
constrained by their reliance on language modality and their limited
understanding of the physical world, while video models lack interactive action
control over the world simulations. This paper makes a step towards building a
general world model by introducing Pandora, a hybrid autoregressive-diffusion
model that simulates world states by generating videos and allows real-time
control with free-text actions. Pandora achieves domain generality, video
consistency, and controllability through large-scale pretraining and
instruction tuning. Crucially, Pandora bypasses the cost of
training-from-scratch by integrating a pretrained LLM (7B) and a pretrained
video model, requiring only additional lightweight finetuning. We illustrate
extensive outputs by Pandora across diverse domains (indoor/outdoor,
natural/urban, human/robot, 2D/3D, etc.). The results indicate great potential
of building stronger general world models with larger-scale training.Summary
AI-Generated Summary