コンピュータ利用世界モデル

要旨

複雑なソフトウェア環境で動作するエージェントは、自身の行動の結果について推論を行うことで利益を得る。なぜなら、単一の誤ったユーザーインターフェース（UI）操作でさえ、成果物を保存する長いワークフローを脱線させうるからである。この課題は、コンピュータ利用シナリオにおいて特に深刻である。この環境は完全にデジタルで決定的であるにもかかわらず、実際の実行では反事実的な探索が支持されず、大規模な試行錯誤学習と計画を非現実的なものにする。我々は、デスクトップソフトウェア向けの世界モデルであるComputer-Using World Model（CUWM）を提案する。CUWMは、現在の状態と候補となる行動が与えられた時に、次のUI状態を予測する。CUWMはUIダイナミクスの2段階の因子分解を採用する：まず、エージェントに関連する状態変化のテキスト記述を予測し、次にこれらの変化を視覚的に具現化して次のスクリーンショットを合成する。CUWMは、実際のMicrosoft Officeアプリケーションと対話するエージェントから収集されたオフラインのUI遷移データで学習され、テキスト遷移予測をコンピュータ利用環境の構造的要件に整合させる軽量な強化学習段階によってさらに洗練される。我々は、テスト時行動探索を通じてCUWMを評価する。これは、凍結されたエージェントが実行前に候補行動をシミュレートし比較するために世界モデルを使用する手法である。様々なOfficeタスクにわたって、世界モデルが導くテスト時スケーリングは意思決定の質と実行の堅牢性を向上させる。

English

Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact-preserving workflows. This challenge is particularly acute for computer-using scenarios, where real execution does not support counterfactual exploration, making large-scale trial-and-error learning and planning impractical despite the environment being fully digital and deterministic. We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action. CUWM adopts a two-stage factorization of UI dynamics: it first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot. CUWM is trained on offline UI transitions collected from agents interacting with real Microsoft Office applications, and further refined with a lightweight reinforcement learning stage that aligns textual transition predictions with the structural requirements of computer-using environments. We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution. Across a range of Office tasks, world-model-guided test-time scaling improves decision quality and execution robustness.