计算机应用世界模型

摘要

在复杂的软件环境中运行的智能体需对其行为后果进行推理，因为即便单个错误的用户界面操作也可能破坏需要保持工作成果的长期工作流。这一挑战在计算机使用场景中尤为突出——尽管环境完全数字化且具有确定性，但由于实际执行过程不支持反事实推演，大规模试错学习与规划变得不切实际。我们提出计算机使用世界模型（CUWM），该桌面软件世界模型能够根据当前状态及候选动作预测下一UI状态。CUWM采用两阶段因子化方法解析UI动态：首先生成智能体相关状态变化的文本描述，继而通过可视化实现这些变化以合成下一屏幕截图。该模型基于从真实Microsoft Office应用交互中采集的离线UI转换数据进行训练，并通过轻量级强化学习阶段进一步优化，使文本转换预测与计算机使用环境的结构要求相契合。我们通过测试时动作搜索评估CUWM：冻结状态的智能体在执行前使用世界模型模拟比较候选动作。在多项Office任务测试中，基于世界模型的测试时扩展策略显著提升了决策质量与执行鲁棒性。

English

Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact-preserving workflows. This challenge is particularly acute for computer-using scenarios, where real execution does not support counterfactual exploration, making large-scale trial-and-error learning and planning impractical despite the environment being fully digital and deterministic. We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action. CUWM adopts a two-stage factorization of UI dynamics: it first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot. CUWM is trained on offline UI transitions collected from agents interacting with real Microsoft Office applications, and further refined with a lightweight reinforcement learning stage that aligns textual transition predictions with the structural requirements of computer-using environments. We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution. Across a range of Office tasks, world-model-guided test-time scaling improves decision quality and execution robustness.