基于大语言模型智能体的强化世界模型学习

摘要

大型语言模型（LLM）在以语言为核心的任务中表现出色，但在智能体场景中，LLM往往难以预测行动后果并适应环境动态，这凸显了基于LLM的智能体对世界建模能力的需求。我们提出强化世界模型学习（RWML），这是一种自监督方法，利用模拟到现实的差异奖励，在文本状态下为基于LLM的智能体学习行动条件化的世界模型。该方法通过将模型生成的模拟下一状态与环境观测到的实际下一状态在预训练嵌入空间中对齐，促使内部世界模拟与实际环境动态保持一致性。与侧重于词元级保真度（即精确复现措辞）而忽视语义等价性的下一状态词元预测不同——后者可能导致模型坍塌——我们的方法提供了更稳健的训练信号，并且实证表明比LLM作为评判者的方法更不易受到奖励破解的影响。我们在ALFWorld和τ^2 Bench数据集上评估该方法，发现相较于基线模型取得了显著提升，且整个过程完全自监督。当与任务成功奖励结合时，本方法在ALFWorld和τ^2 Bench上分别以6.9分和5.7分的优势超越直接使用任务成功奖励的强化学习，同时达到了专家数据训练的性能水平。

English

Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and τ^2 Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and τ^2 Bench respectively, while matching the performance of expert-data training.

基于大语言模型智能体的强化世界模型学习

Reinforcement World Model Learning for LLM-based Agents

摘要

Support