语言智能体的策略与世界建模协同训练

摘要

强化学习通过训练大型语言模型代理识别哪些行动能获得高奖励来提升其性能，但对这些行动如何影响环境提供的监督极少。世界建模可以弥补这一缺陷，然而现有方法通常需要独立的模拟器、额外的训练阶段或额外的推理时计算。我们观察到，在策略强化学习轨迹已经包含了所需的信号：每次状态转移都将一个行动与其产生的后续观察配对。基于这一观察，我们提出了PaW（策略与世界建模联合训练框架），该框架在不改变推理范式的前提下，在强化学习过程中为同一策略添加辅助的世界建模监督。为使辅助世界建模监督具备信息性和稳定性，PaW引入了三个组件：基于动作熵的世界模型数据选择、抗噪的世界模型损失函数以及奖励自适应的损失平衡。在三个代理型任务基准上的实验表明，在多种模型和强化学习算法上，该方法均较强大的强化学习基线实现了一致改进。这些结果表明，标准强化学习轨迹是语言代理训练中世界模型监督的实用来源。

English

Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.