潜在行动驱动的自完善世界建模

摘要

对世界的内部建模——预测在行动Z作用下从先前状态X到后续状态Y的转换——对于大语言模型和视觉语言模型的推理与规划至关重要。学习此类模型通常需要耗费高昂的动作标注轨迹。我们提出SWIRL自改进框架，通过将动作视为潜变量，并交替进行前向世界建模P_θ(Y|X,Z)与逆动力学建模Q_φ(Z|X,Y)，从仅含状态的序列中学习。SWIRL迭代执行两个阶段：（1）变分信息最大化：更新前向世界模型以生成能最大化潜动作与先验状态条件互信息的后续状态，促进可识别一致性；（2）ELBO最大化：更新逆动力学模型以解释观测到的状态转换，实现坐标上升。两个模型均采用强化学习（具体为GRPO）进行训练，以冻结模型的对数概率作为奖励信号。我们为两种更新提供了理论可学习性保证，并在多环境中评估SWIRL：单轮/多轮开放世界视觉动态环境，以及物理、网络和工具调用的合成文本环境。SWIRL在AURORABench上提升16%，ByteMorph提升28%，WorldPredictionBench提升16%，StableToolBench提升14%。

English

Internal modelling of the world -- predicting transitions between previous states X and next states Y under actions Z -- is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled trajectories. We propose SWIRL, a self-improvement framework that learns from state-only sequences by treating actions as a latent variable and alternating between Forward World Modelling (FWM) P_θ(Y|X,Z) and an Inverse Dynamics Modelling (IDM) Q_φ(Z|X,Y). SWIRL iterates two phases: (1) Variational Information Maximisation, which updates the FWM to generate next states that maximise conditional mutual information with latent actions given prior states, encouraging identifiable consistency; and (2) ELBO Maximisation, which updates the IDM to explain observed transitions, effectively performing coordinate ascent. Both models are trained with reinforcement learning (specifically, GRPO) with the opposite frozen model's log-probability as a reward signal. We provide theoretical learnability guarantees for both updates, and evaluate SWIRL on LLMs and VLMs across multiple environments: single-turn and multi-turn open-world visual dynamics and synthetic textual environments for physics, web, and tool calling. SWIRL achieves gains of 16% on AURORABench, 28% on ByteMorph, 16% on WorldPredictionBench, and 14% on StableToolBench.