基于概率结构整合的世界建模
World Modeling with Probabilistic Structure Integration
September 10, 2025
作者: Klemen Kotar, Wanhee Lee, Rahul Venkatesh, Honglin Chen, Daniel Bear, Jared Watrous, Simon Kim, Khai Loong Aw, Lilian Naing Chen, Stefan Stojanov, Kevin Feigelis, Imran Thobani, Alex Durango, Khaled Jedoui, Atlas Kazemian, Dan Yamins
cs.AI
摘要
我们提出了概率结构集成(Probabilistic Structure Integration, PSI)系统,该系统能够从数据中学习具有丰富可控性和灵活提示性的世界模型。PSI包含一个三步循环过程。第一步,概率预测,涉及构建数据的概率图模型Psi,其形式为随机访问的自回归序列模型。Psi支持一整套学习到的条件分布,这些分布描述了数据中任意变量对任何其他变量集的依赖关系。在第二步,结构提取中,我们展示了如何通过Psi上的因果推理,以零样本方式提取数据中潜在的、对应于多种有意义“中间结构”的低维特性。第三步,集成,通过将这些结构转化为新的标记类型,并持续将其作为条件信号和预测目标混合回训练数据中,从而完成循环。每一次这样的循环都增强了Psi的能力,既使其能更好地建模底层数据,又创造了新的控制手段——类似于大语言模型(LLM)的通用提示语言。我们在1.4万亿个互联网视频数据标记上训练了一个Psi实例;利用它执行了多种有用的视频预测和理解推理;提取了最先进的光流、自监督深度和对象分割;并利用这些结构支持了预测改进的完整循环。
English
We present Probabilistic Structure Integration (PSI), a system for learning
richly controllable and flexibly promptable world models from data. PSI
consists of a three-step cycle. The first step, Probabilistic prediction,
involves building a probabilistic graphical model Psi of the data, in the form
of a random-access autoregressive sequence model. Psi supports a complete set
of learned conditional distributions describing the dependence of any variables
in the data on any other set of variables. In step 2, Structure extraction, we
show how to extract underlying low-dimensional properties in the data,
corresponding to a diverse set of meaningful "intermediate structures", in a
zero-shot fashion via causal inference on Psi. Step 3, Integration, completes
the cycle by converting these structures into new token types that are then
continually mixed back into the training diet as conditioning signals and
prediction targets. Each such cycle augments the capabilities of Psi, both
allowing it to model the underlying data better, and creating new control
handles -- akin to an LLM-like universal prompting language. We train an
instance of Psi on 1.4 trillion tokens of internet video data; we use it to
perform a variety of useful video prediction and understanding inferences; we
extract state-of-the-art optical flow, self-supervised depth and object
segmentation; and we use these structures to support a full cycle of predictive
improvements.