ChatPaper.aiChatPaper

PAN:一個通用、可互動且長時域的世界模擬世界模型

PAN: A World Model for General, Interactable, and Long-Horizon World Simulation

November 12, 2025
作者: PAN Team, Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, Davit Abrahamyan, Arif Ahmad, Ganesh Bannur, Junrong Chen, Kimi Chen, Mingkai Deng, Ruobing Han, Xinqi Huang, Haoqiang Kang, Zheqi Li, Enze Ma, Hector Ren, Yashowardhan Shinde, Rohan Shingre, Ramsundar Tanikella, Kaiming Tao, Dequan Yang, Xinle Yu, Cong Zeng, Binglin Zhou, Zhengzhong Liu, Zhiting Hu, Eric P. Xing
cs.AI

摘要

世界模型使智能體能夠想像、預測並推斷世界如何隨其行為而演變,從而進行規劃與策略制定。儘管近期影片生成模型能產出逼真的視覺序列,但它們通常以提示到完整影片的模式運作,缺乏因果控制性、互動性以及目的性推理所需的長時序一致性。另一方面,現有世界建模研究多聚焦於受限領域(如物理、遊戲或3D場景動態),其深度與可控性有限,且難以跨多元環境與互動形式泛化。本研究提出PAN模型——一個通用、可互動且具長時序預測能力的世界模型,它能根據歷史與自然語言動作生成高品質影片模擬來預測未來世界狀態。PAN採用生成式潛在預測架構,結合基於大型語言模型的自迴歸潛在動態骨幹(利用文本知識基礎實現語言條件化模擬)與影片擴散解碼器(重建感知細節豐富且時序連貫的視覺觀測),從而達成潛在空間推理(想像)與可實現世界動態(現實)的統一。通過在大規模跨領域影片-動作對上訓練,PAN支援開放領域的動作條件化模擬,並具備連貫的長期動態特性。大量實驗表明,PAN在動作條件化世界模擬、長時序預測與模擬推理方面優於現有影片生成器與世界模型,為實現通用世界模型邁出關鍵一步,使能通過預測性模擬未來世界狀態進行推理與行動。
English
A world model enables an intelligent agent to imagine, predict, and reason about how the world evolves in response to its actions, and accordingly to plan and strategize. While recent video generation models produce realistic visual sequences, they typically operate in the prompt-to-full-video manner without causal control, interactivity, or long-horizon consistency required for purposeful reasoning. Existing world modeling efforts, on the other hand, often focus on restricted domains (e.g., physical, game, or 3D-scene dynamics) with limited depth and controllability, and struggle to generalize across diverse environments and interaction formats. In this work, we introduce PAN, a general, interactable, and long-horizon world model that predicts future world states through high-quality video simulation conditioned on history and natural language actions. PAN employs the Generative Latent Prediction (GLP) architecture that combines an autoregressive latent dynamics backbone based on a large language model (LLM), which grounds simulation in extensive text-based knowledge and enables conditioning on language-specified actions, with a video diffusion decoder that reconstructs perceptually detailed and temporally coherent visual observations, to achieve a unification between latent space reasoning (imagination) and realizable world dynamics (reality). Trained on large-scale video-action pairs spanning diverse domains, PAN supports open-domain, action-conditioned simulation with coherent, long-term dynamics. Extensive experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning compared to other video generators and world models, taking a step towards general world models that enable predictive simulation of future world states for reasoning and acting.
PDF804February 8, 2026