語言智能體的策略與世界建模共同訓練

摘要

强化学习通过教导大语言模型智能体哪些动作能获得高奖励来提升其性能，但对于这些动作如何影响环境却缺乏监督。世界建模可以弥补这一不足，但现有方法通常需要独立的模拟器、额外的训练阶段或额外的推理时计算。我们观察到，在策略强化学习展开已经包含了所需的信号：每个转移对都将一个动作与其产生的下一个观察结果配对。基于这一观察，我们提出了PaW，一个策略与世界建模协同训练框架，在强化学习过程中向同一策略添加辅助世界建模监督，且不改变推理范式。为使辅助世界建模监督信息丰富且稳定，PaW引入了三个组件：基于动作熵的世界建模数据选择、容忍噪声的世界建模损失以及奖励自适应的损失平衡。在三个智能体任务基准上的实验表明，跨模型和强化学习算法相比强基线方法均取得一致改进。这些结果表明，标准强化学习展开是语言智能体训练中世界建模监督的一个实用来源。

English

Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.