弥合智能体与世界的鸿沟：基于大语言模型智能体的文本世界模型

摘要

基于大型语言模型（LLM）的智能体越来越多地应用于交互式文本环境，包括网页导航、代码编辑、工具使用以及长程对话。然而，其中许多智能体仍主要处于被动反应状态，将观察结果映射为行动，而缺乏对这些环境如何构成及演变的明确模型。这一现状催生了文本世界模型（TWM）：即基于文本状态的转移模型——给定一个状态和候选行动，预测生成的网页、终端输出、API响应或用户回复，从而支持规划、高效学习以及有原则的评估。我们围绕一个正式框架与智能体生命周期，系统综述了用于基于LLM智能体的文本世界模型：(1) 基础——定义文本世界模型，并按状态表示与基础领域对其进行分类；(2) 构建——对“LLM作为世界模型”和“代码作为世界模型”两种范式进行分类型阐述，并综述构建方法；(3) 应用——考察世界模型如何在训练阶段通过经验合成、在推理阶段通过规划、验证与自适应来支持智能体；(4) 评估——涵盖对世界模型本身的评估，以及将其用作智能体评估环境的评估方法。我们旨在整合这一快速发展的领域，厘清其设计空间，并指出未来研究中的开放性挑战。

English

Large language model (LLM)-based agents are increasingly used in interactive textual environments, from web navigation and code editing to tool use and long-horizon dialogue. Yet many remain largely reactive, mapping observations to actions without an explicit model of how these environments are structured and evolve. This motivates text world models (TWMs): transition models over textual states that, given a state and a candidate action, predict the resulting webpage, terminal output, API response, or user reply, thereby supporting planning, efficient learning, and principled evaluation. We systematically review text world models for LLM-based agents, organized around a formal framework and the agent lifecycle: (1) Foundations, defining text world models and characterizing them by state representation and grounding domain; (2) Construction, taxonomizing LLM-as-WM and code-as-WM paradigms and reviewing methods for building them; (3) Application, examining how world models support agents at training time through experience synthesis and at inference time through planning, verification, and adaptation; and (4) Evaluation, covering both evaluation of the world model itself and its use as an evaluation environment for agents. We aim to consolidate this rapidly developing area, clarify its design space, and highlight open challenges for future research.