ChatPaper.aiChatPaper

PAN:面向通用、可交互及长时程世界仿真的世界模型

PAN: A World Model for General, Interactable, and Long-Horizon World Simulation

November 12, 2025
作者: PAN Team, Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, Davit Abrahamyan, Arif Ahmad, Ganesh Bannur, Junrong Chen, Kimi Chen, Mingkai Deng, Ruobing Han, Xinqi Huang, Haoqiang Kang, Zheqi Li, Enze Ma, Hector Ren, Yashowardhan Shinde, Rohan Shingre, Ramsundar Tanikella, Kaiming Tao, Dequan Yang, Xinle Yu, Cong Zeng, Binglin Zhou, Zhengzhong Liu, Zhiting Hu, Eric P. Xing
cs.AI

摘要

世界模型使智能体能够想象、预测并推理世界如何响应其行为而演变,从而进行规划与决策。尽管当前视频生成模型能生成逼真的视觉序列,但它们通常以提示到完整视频的方式运行,缺乏因果控制、交互性或实现有目的推理所需的长期一致性。另一方面,现有世界建模研究多局限于特定领域(如物理、游戏或3D场景动态),其深度与可控性有限,且难以跨多样环境与交互形式泛化。本文提出PAN模型——一种通用、可交互、长视野的世界模型,通过基于历史与自然语言行为的高质量视频模拟预测未来世界状态。PAN采用生成式潜在预测架构,结合基于大语言模型的自回归潜在动态主干(将模拟锚定于海量文本知识并支持语言指定行为的条件生成)与视频扩散解码器(重建感知细节丰富且时序连贯的视觉观测),实现了潜在空间推理(想象)与可实现世界动态(现实)的统一。通过在跨领域的大规模视频-行为数据上训练,PAN支持具有连贯长期动态的开放领域行为条件模拟。大量实验表明,PAN在行为条件世界模拟、长视野预测和模拟推理方面优于其他视频生成器与世界模型,为构建能够通过预测性世界状态模拟实现推理与行动的通用世界模型迈出关键一步。
English
A world model enables an intelligent agent to imagine, predict, and reason about how the world evolves in response to its actions, and accordingly to plan and strategize. While recent video generation models produce realistic visual sequences, they typically operate in the prompt-to-full-video manner without causal control, interactivity, or long-horizon consistency required for purposeful reasoning. Existing world modeling efforts, on the other hand, often focus on restricted domains (e.g., physical, game, or 3D-scene dynamics) with limited depth and controllability, and struggle to generalize across diverse environments and interaction formats. In this work, we introduce PAN, a general, interactable, and long-horizon world model that predicts future world states through high-quality video simulation conditioned on history and natural language actions. PAN employs the Generative Latent Prediction (GLP) architecture that combines an autoregressive latent dynamics backbone based on a large language model (LLM), which grounds simulation in extensive text-based knowledge and enables conditioning on language-specified actions, with a video diffusion decoder that reconstructs perceptually detailed and temporally coherent visual observations, to achieve a unification between latent space reasoning (imagination) and realizable world dynamics (reality). Trained on large-scale video-action pairs spanning diverse domains, PAN supports open-domain, action-conditioned simulation with coherent, long-term dynamics. Extensive experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning compared to other video generators and world models, taking a step towards general world models that enable predictive simulation of future world states for reasoning and acting.
PDF733December 1, 2025