ChatPaper.aiChatPaper

R-WoM:面向计算机使用代理的检索增强型世界模型

R-WoM: Retrieval-augmented World Model For Computer-use Agents

October 13, 2025
作者: Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee, Xing Niu, Jiarong Jiang
cs.AI

摘要

大型语言模型(LLMs)可作为世界模型,通过模拟未来状态和预测行动结果来增强智能体在数字环境中的决策能力,从而可能减少昂贵的试错探索。然而,这一能力从根本上受到LLMs倾向于产生幻觉及其依赖静态训练知识的限制,这可能导致误差累积,阻碍长期视野的模拟。为了系统性地探究LLMs是否适合用于世界建模,我们通过三项任务——下一状态识别、全过程规划对齐及里程碑转换识别——来检验世界模型的两项核心能力:未来状态预测与奖励估计。分析表明,尽管LLMs能有效捕捉即时下一状态并识别有意义的状态转换,但在全过程规划中其性能迅速下降,凸显了LLMs在长期环境动态建模上的局限性。针对这些局限,我们提出了检索增强型世界模型(R-WoM),该模型通过整合从外部教程中检索到的事实性、最新知识,为LLM模拟提供基础。实验结果显示,与基线相比,R-WoM在OSWorld和WebArena上分别实现了高达25.3%和18.1%的显著提升,尤其在长期视野模拟中展现出独特优势。
English
Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLMs' tendency toward hallucination and their reliance on static training knowledge, which can lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models--future state prediction and reward estimation--through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMs' limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves substantial improvements of up to 25.3% (OSWorld) and 18.1% (WebArena) compared to baselines, with particular advantages in longer-horizon simulations.
PDF212October 15, 2025