ChatPaper.aiChatPaper

基於大型語言模型的智慧體強化世界模型學習

Reinforcement World Model Learning for LLM-based Agents

February 5, 2026
作者: Xiao Yu, Baolin Peng, Ruize Xu, Yelong Shen, Pengcheng He, Suman Nath, Nikhil Singh, Jiangfeng Gao, Zhou Yu
cs.AI

摘要

大型语言模型(LLM)在以语言为中心的任务中表现出卓越性能。然而在智能体场景下,LLM往往难以预测行动后果并适应环境动态,这凸显了基于LLM的智能体对世界建模能力的需求。我们提出强化世界模型学习(RWML),这是一种自监督方法,利用仿真与现实差距奖励在文本状态上为基于LLM的智能体学习动作条件化的世界模型。该方法通过预训练嵌入空间,使模型生成的模拟下一状态与环境观测到的实际下一状态对齐,促进内部世界模拟与实际环境动态的一致性。与侧重词元级保真度(即精确复现措辞)而忽视语义等价性的下一状态词元预测不同(后者可能导致模型崩溃),我们的方法提供了更稳健的训练信号,并且实证表明比LLM作为评判器更不易出现奖励破解。我们在ALFWorld和τ^2 Bench数据集上的评估表明,该方法在完全自监督的前提下显著超越了基线模型。当与任务成功奖励结合时,我们的方法在ALFWorld和τ^2 Bench上分别以6.9分和5.7分的优势超越直接使用任务成功奖励的强化学习,同时达到了专家数据训练的性能水平。
English
Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and τ^2 Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and τ^2 Bench respectively, while matching the performance of expert-data training.
PDF112February 7, 2026