橋接智能體與世界的鴻溝：基於LLM的智能體的文字世界模型

摘要

基於大型語言模型（LLM）的智慧代理廣泛應用於互動式文本環境中，涵蓋網頁導航、程式碼編輯、工具使用以及長時間跨度的對話等場景。然而，許多代理仍停留在被動反應階段，僅將觀察結果映射為行動，卻缺乏對環境結構與演變方式的明確模型。這促使了文本世界模型（TWM）的發展：一種針對文本狀態的轉換模型，在給定狀態與候選行動後，能預測出對應的網頁內容、終端輸出、API回應或用戶回覆，進而支援規劃、高效學習與原則性評估。我們系統性地回顧了用於LLM驅動代理的文本世界模型，並圍繞一個正式框架與代理的生命週期進行組織：（1）基礎定義：界定文本世界模型，並依據狀態表示與領域基礎進行分類；（2）建構方法：歸納「LLM即世界模型」與「程式碼即世界模型」兩大典範，並回顧相關建構方法；（3）應用層面：探討世界模型如何在訓練階段透過經驗合成，以及在推論階段透過規劃、驗證與適應來支援代理運作；（4）評估方式：涵蓋對世界模型本身的評估，以及將其作為代理評估環境的使用。我們旨在整合這個快速發展的領域，釐清其設計空間，並為未來研究指出其開放性挑戰。

English

Large language model (LLM)-based agents are increasingly used in interactive textual environments, from web navigation and code editing to tool use and long-horizon dialogue. Yet many remain largely reactive, mapping observations to actions without an explicit model of how these environments are structured and evolve. This motivates text world models (TWMs): transition models over textual states that, given a state and a candidate action, predict the resulting webpage, terminal output, API response, or user reply, thereby supporting planning, efficient learning, and principled evaluation. We systematically review text world models for LLM-based agents, organized around a formal framework and the agent lifecycle: (1) Foundations, defining text world models and characterizing them by state representation and grounding domain; (2) Construction, taxonomizing LLM-as-WM and code-as-WM paradigms and reviewing methods for building them; (3) Application, examining how world models support agents at training time through experience synthesis and at inference time through planning, verification, and adaptation; and (4) Evaluation, covering both evaluation of the world model itself and its use as an evaluation environment for agents. We aim to consolidate this rapidly developing area, clarify its design space, and highlight open challenges for future research.