言語エージェントのための方策と世界モデリングの共訓練

要旨

強化学習（RL）は、大規模言語モデル（LLM）エージェントに対し、高い報酬を得る行動を学習させることで性能を向上させるが、それらの行動が環境にどのような影響を与えるかについての監督情報はほとんど提供しない。世界モデリング（WM）はこのギャップを埋める可能性があるが、既存手法では多くの場合、別個のシミュレータや追加の学習段階、あるいは推論時の計算が別途必要となる。我々は、方策オン型のRLロールアウトがすでに必要な信号を含んでいることに着目する。すなわち、各遷移は行動とそれに続く次の観測をペアとして保持する。この観察に基づき、我々はPaW（Policy and World modeling co-training）を提案する。これは、推論パラダイムを変更することなく、RL中の同一方策に対して補助的なWM監視を追加する共学習フレームワークである。補助的なWM監視を情報豊かで安定したものにするため、PaWは三つの構成要素を導入する。すなわち、行動エントロピーに基づくWMデータ選択、ノイズ耐性を持つWM損失、および報酬適応型の損失バランス調整である。三つのエージェント型タスクベンチマークにおける実験では、モデルやRLアルゴリズムを問わず、強力なRLベースラインに対して一貫した改善が確認された。これらの結果は、標準的なRLロールアウトが言語エージェント学習におけるWM監視の実用的な源泉であることを示唆している。

English

Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.