基於潛在行動的自改進世界建模

摘要

對世界的內部建模——預測在行動Z作用下從先前狀態X到後繼狀態Y的轉變——對大型語言模型和視覺語言模型的推理與規劃至關重要。學習這類模型通常需要耗費大量資源的行動標註軌跡數據。我們提出SWIRL框架，通過將行動視為潛在變量，並交替進行前向世界建模P_θ(Y|X,Z)與逆向動力學建模Q_φ(Z|X,Y)，實現僅從狀態序列中學習的自我改進機制。SWIRL迭代執行兩個階段：(1) 變分信息最大化：更新前向世界模型，使其生成能最大化潛在行動與先前狀態條件互信息的後繼狀態，強化可識別一致性；(2) ELBO最大化：更新逆向動力學模型以解釋觀測到的狀態轉變，實質上執行坐標上升法。兩模型均採用強化學習（具體為GRPO算法）進行訓練，以對立凍結模型的對數概率作為獎勵信號。我們為兩階段的更新提供了理論可學習性保證，並在多種環境中評估SWIRL：針對物理、網絡及工具調用的單輪/多輪開放世界視覺動態環境與合成文本環境。SWIRL在AURORABench提升16%，ByteMorph提升28%，WorldPredictionBench提升16%，StableToolBench提升14%。

English

Internal modelling of the world -- predicting transitions between previous states X and next states Y under actions Z -- is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled trajectories. We propose SWIRL, a self-improvement framework that learns from state-only sequences by treating actions as a latent variable and alternating between Forward World Modelling (FWM) P_θ(Y|X,Z) and an Inverse Dynamics Modelling (IDM) Q_φ(Z|X,Y). SWIRL iterates two phases: (1) Variational Information Maximisation, which updates the FWM to generate next states that maximise conditional mutual information with latent actions given prior states, encouraging identifiable consistency; and (2) ELBO Maximisation, which updates the IDM to explain observed transitions, effectively performing coordinate ascent. Both models are trained with reinforcement learning (specifically, GRPO) with the opposite frozen model's log-probability as a reward signal. We provide theoretical learnability guarantees for both updates, and evaluate SWIRL on LLMs and VLMs across multiple environments: single-turn and multi-turn open-world visual dynamics and synthetic textual environments for physics, web, and tool calling. SWIRL achieves gains of 16% on AURORABench, 28% on ByteMorph, 16% on WorldPredictionBench, and 14% on StableToolBench.

基於潛在行動的自改進世界建模

Self-Improving World Modelling with Latent Actions

摘要

Support