基於概率結構整合的世界建模
World Modeling with Probabilistic Structure Integration
September 10, 2025
作者: Klemen Kotar, Wanhee Lee, Rahul Venkatesh, Honglin Chen, Daniel Bear, Jared Watrous, Simon Kim, Khai Loong Aw, Lilian Naing Chen, Stefan Stojanov, Kevin Feigelis, Imran Thobani, Alex Durango, Khaled Jedoui, Atlas Kazemian, Dan Yamins
cs.AI
摘要
我們提出了概率結構整合(Probabilistic Structure Integration, PSI)系統,這是一個從數據中學習具有豐富可控性和靈活提示性的世界模型的系統。PSI由一個三步循環組成。第一步,概率預測,涉及構建數據的概率圖模型Psi,其形式為隨機訪問的自回歸序列模型。Psi支持一整套學習到的條件分佈,這些分佈描述了數據中任何變量對其他任何變量集的依賴關係。在第二步,結構提取中,我們展示了如何通過對Psi進行因果推理,以零樣本的方式提取數據中對應於多種有意義的“中間結構”的潛在低維特性。第三步,整合,通過將這些結構轉換為新的標記類型來完成循環,這些標記類型隨後作為條件信號和預測目標不斷混合回訓練數據中。每個這樣的循環都增強了Psi的能力,使其既能更好地建模基礎數據,又能創建新的控制手柄——類似於大型語言模型(LLM)的通用提示語言。我們在1.4萬億個互聯網視頻數據的標記上訓練了一個Psi實例;我們用它來執行各種有用的視頻預測和理解推理;我們提取了最先進的光流、自監督深度和對象分割;並且我們利用這些結構來支持一個完整的預測改進循環。
English
We present Probabilistic Structure Integration (PSI), a system for learning
richly controllable and flexibly promptable world models from data. PSI
consists of a three-step cycle. The first step, Probabilistic prediction,
involves building a probabilistic graphical model Psi of the data, in the form
of a random-access autoregressive sequence model. Psi supports a complete set
of learned conditional distributions describing the dependence of any variables
in the data on any other set of variables. In step 2, Structure extraction, we
show how to extract underlying low-dimensional properties in the data,
corresponding to a diverse set of meaningful "intermediate structures", in a
zero-shot fashion via causal inference on Psi. Step 3, Integration, completes
the cycle by converting these structures into new token types that are then
continually mixed back into the training diet as conditioning signals and
prediction targets. Each such cycle augments the capabilities of Psi, both
allowing it to model the underlying data better, and creating new control
handles -- akin to an LLM-like universal prompting language. We train an
instance of Psi on 1.4 trillion tokens of internet video data; we use it to
perform a variety of useful video prediction and understanding inferences; we
extract state-of-the-art optical flow, self-supervised depth and object
segmentation; and we use these structures to support a full cycle of predictive
improvements.