潜在行動による自己改善的世界モデリング

要旨

世界の内部モデリング――過去の状態Xと次の状態Yの間の行動Zによる遷移を予測すること――は、LLMとVLMの推論と計画にとって不可欠である。このようなモデルの学習には、通常、コストの高い行動ラベル付き軌跡データが必要となる。我々はSWIRLを提案する。これは行動を潜在変数として扱い、順方向世界モデリング（FWM）P_θ(Y|X,Z)と逆力学モデリング（IDM）Q_φ(Z|X,Y)を交互に更新することで、状態のみの系列から学習する自己改善フレームワークである。SWIRLは2つのフェーズを反復する：(1) 変分情報最大化：FWMを更新し、事前状態が与えられた下で潜在行動との条件付き相互情報量を最大化する次の状態を生成することで、識別可能な一貫性を促進する。(2) ELBO最大化：観測された状態遷移を説明するようにIDMを更新し、実質的に座標上昇法を行う。両モデルは強化学習（具体的にはGRPO）により、対立する凍結モデルの対数確率を報酬信号として用いて訓練される。我々は両更新に対する理論的な学習可能性保証を提供し、LLMとVLMにおけるSWIRLを複数の環境で評価する：単一ターン及び複数ターンのオープンワールド視覚力学環境、および物理・Web・ツール呼び出しのための合成的テキスト環境である。SWIRLは、AURORABenchで16%、ByteMorphで28%、WorldPredictionBenchで16%、StableToolBenchで14%の性能向上を達成した。

English

Internal modelling of the world -- predicting transitions between previous states X and next states Y under actions Z -- is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled trajectories. We propose SWIRL, a self-improvement framework that learns from state-only sequences by treating actions as a latent variable and alternating between Forward World Modelling (FWM) P_θ(Y|X,Z) and an Inverse Dynamics Modelling (IDM) Q_φ(Z|X,Y). SWIRL iterates two phases: (1) Variational Information Maximisation, which updates the FWM to generate next states that maximise conditional mutual information with latent actions given prior states, encouraging identifiable consistency; and (2) ELBO Maximisation, which updates the IDM to explain observed transitions, effectively performing coordinate ascent. Both models are trained with reinforcement learning (specifically, GRPO) with the opposite frozen model's log-probability as a reward signal. We provide theoretical learnability guarantees for both updates, and evaluate SWIRL on LLMs and VLMs across multiple environments: single-turn and multi-turn open-world visual dynamics and synthetic textual environments for physics, web, and tool calling. SWIRL achieves gains of 16% on AURORABench, 28% on ByteMorph, 16% on WorldPredictionBench, and 14% on StableToolBench.

潜在行動による自己改善的世界モデリング

Self-Improving World Modelling with Latent Actions

要旨

Support