R-WoM：面向计算机使用代理的检索增强型世界模型

摘要

大型語言模型（LLMs）可作為世界模型，通過模擬未來狀態及預測行動結果來增強數位環境中代理的決策能力，從而可能消除成本高昂的試錯探索。然而，這一能力從根本上受到LLMs傾向於產生幻覺及其依賴靜態訓練知識的限制，這可能導致錯誤累積，阻礙長期視野的模擬。為系統性地探討LLMs是否適合用於世界建模，我們通過三項任務——下一狀態識別、全過程規劃對齊及里程碑轉變識別——來檢驗世界模型的兩項核心能力：未來狀態預測與獎勵估計。我們的分析顯示，儘管LLMs能有效捕捉即時下一狀態並識別有意義的狀態轉變，但其在全過程規劃中的表現迅速下降。這凸顯了LLMs在長期視野下可靠模擬環境動態方面的局限性。為應對這些限制，我們提出了檢索增強型世界模型（R-WoM），該模型通過整合從外部教程檢索的事實性、最新知識來錨定LLM的模擬。實驗表明，與基線相比，R-WoM在OSWorld和WebArena上分別實現了高達25.3%和18.1%的顯著提升，尤其在更長視野的模擬中展現出特別優勢。

English

Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLMs' tendency toward hallucination and their reliance on static training knowledge, which can lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models--future state prediction and reward estimation--through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMs' limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves substantial improvements of up to 25.3% (OSWorld) and 18.1% (WebArena) compared to baselines, with particular advantages in longer-horizon simulations.

R-WoM：面向计算机使用代理的检索增强型世界模型

R-WoM: Retrieval-augmented World Model For Computer-use Agents

摘要

Support