计算机应用世界模型

摘要

在复杂软件环境中运行的智能体，能够通过推理其行动后果而获益——因为即便是单个错误的用户界面操作，也可能导致需要保留操作痕迹的冗长工作流程偏离正轨。这一挑战在计算机使用场景中尤为严峻：由于实际执行过程不支持反事实推演，尽管环境完全数字化且具有确定性，大规模试错学习与规划仍难以实现。我们提出计算机使用世界模型（CUWM），这是一种面向桌面软件的世界模型，能够根据当前状态和候选动作预测下一UI状态。CUWM采用两阶段因子化方法建模UI动态：首先生成智能体相关状态变化的文本描述，继而通过视觉化实现这些变化以合成下一帧屏幕图像。该模型基于从真实Microsoft Office应用交互中采集的离线UI状态转换数据进行训练，并通过轻量级强化学习阶段进一步优化，使文本转换预测符合计算机使用环境的结构化要求。我们通过测试时动作搜索评估CUWM：冻结状态的智能体在执行前使用世界模型模拟并比较候选动作。在多项Office任务测试中，世界模型引导的测试时扩展策略显著提升了决策质量与执行鲁棒性。

English

Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact-preserving workflows. This challenge is particularly acute for computer-using scenarios, where real execution does not support counterfactual exploration, making large-scale trial-and-error learning and planning impractical despite the environment being fully digital and deterministic. We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action. CUWM adopts a two-stage factorization of UI dynamics: it first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot. CUWM is trained on offline UI transitions collected from agents interacting with real Microsoft Office applications, and further refined with a lightweight reinforcement learning stage that aligns textual transition predictions with the structural requirements of computer-using environments. We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution. Across a range of Office tasks, world-model-guided test-time scaling improves decision quality and execution robustness.