잠재 행동을 활용한 자기 향상 세계 모델링

초록

세계의 내부 모델링 — 이전 상태 X와 다음 상태 Y 간의 행동 Z 하에서의 전이를 예측하는 것 — 은 LLM과 VLM의 추론 및 계획 수립에 필수적입니다. 이러한 모델 학습에는 일반적으로 비용이 많이 드는 행동 레이블이 지정된 궤적이 필요합니다. 본 연구에서는 행동을 잠재 변수로 간주하고 순방향 세계 모델링(FWM) P_θ(Y|X,Z)과 역역학 모델링(IDM) Q_φ(Z|X,Y)을 교대로 수행함으로써 상태만으로 구성된 시퀀스로부터 학습하는 자기 개선 프레임워크인 SWIRL을 제안합니다. SWIRL은 두 단계를 반복합니다: (1) **변분 정보 최대화**: FWM을 업데이트하여 이전 상태가 주어졌을 때 잠재 행동과의 조건부 상호 정보를 최대화하는 다음 상태를 생성하며, 이는 식별 가능한 일관성을 촉진합니다. (2) **ELBO 최대화**: 관찰된 전이를 설명하도록 IDM을 업데이트하며, 효과적으로 좌표 상승을 수행합니다. 두 모델은 강화 학습(구체적으로 GRPO)을 통해 훈련되며, 고정된 반대 모델의 로그 확률을 보상 신호로 사용합니다. 우리는 두 업데이트에 대한 이론적 학습 가능성 보장을 제공하고, SWIRL을 여러 환경(단일 턴 및 다중 턴 오픈 월드 시각 역학 환경, 그리고 물리, 웹, 도구 호출을 위한 합성 텍스트 환경)에서 LLM과 VLM에 대해 평가합니다. SWIRL은 AURORABench에서 16%, ByteMorph에서 28%, WorldPredictionBench에서 16%, StableToolBench에서 14%의 성능 향상을 달성했습니다.

English

Internal modelling of the world -- predicting transitions between previous states X and next states Y under actions Z -- is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled trajectories. We propose SWIRL, a self-improvement framework that learns from state-only sequences by treating actions as a latent variable and alternating between Forward World Modelling (FWM) P_θ(Y|X,Z) and an Inverse Dynamics Modelling (IDM) Q_φ(Z|X,Y). SWIRL iterates two phases: (1) Variational Information Maximisation, which updates the FWM to generate next states that maximise conditional mutual information with latent actions given prior states, encouraging identifiable consistency; and (2) ELBO Maximisation, which updates the IDM to explain observed transitions, effectively performing coordinate ascent. Both models are trained with reinforcement learning (specifically, GRPO) with the opposite frozen model's log-probability as a reward signal. We provide theoretical learnability guarantees for both updates, and evaluate SWIRL on LLMs and VLMs across multiple environments: single-turn and multi-turn open-world visual dynamics and synthetic textual environments for physics, web, and tool calling. SWIRL achieves gains of 16% on AURORABench, 28% on ByteMorph, 16% on WorldPredictionBench, and 14% on StableToolBench.

잠재 행동을 활용한 자기 향상 세계 모델링

Self-Improving World Modelling with Latent Actions

초록

Support