컴퓨터를 활용하는 세계 모델

초록

복잡한 소프트웨어 환경에서 작동하는 에이전트는 자신의 행동 결과에 대해 추론함으로써 이점을 얻습니다. 단일 사용자 인터페이스(UI) 작업이라도 잘못 수행되면 오랜 시간이 걸리고 결과물을 보존해야 하는 워크플로우를 완전히 틀어지게 할 수 있기 때문입니다. 이러한 어려움은 컴퓨터 사용 시나리오에서 특히 심각한데, 실제 실행 환경은 반사실적 탐색을 지원하지 않아 환경이 완전히 디지털이고 결정론적임에도 불구하고 대규모 시행착오 학습 및 계획 수립을 실용적으로 만들지 못합니다. 본 논문에서는 데스크톱 소프트웨어를 위한 월드 모델인 CUWM(Computer-Using World Model)을 소개합니다. CUWM은 현재 상태와 후보 행동이 주어졌을 때 다음 UI 상태를 예측합니다. CUWM은 UI 동역학을 두 단계로 나누어 접근합니다. 먼저 에이전트와 관련된 상태 변화에 대한 텍스트 설명을 예측한 다음, 이러한 변화를 시각적으로 구현하여 다음 스크린샷을 합성합니다. CUWM은 실제 Microsoft Office 애플리케이션과 상호작용하는 에이전트로부터 수집한 오프라인 UI 전이 데이터로 훈련되며, 텍스트 기반 전이 예측을 컴퓨터 사용 환경의 구조적 요구사항에 맞추기 위한 경량 강화 학습 단계를 통해 추가적으로 개선됩니다. 우리는 CUWM을 실행 전 검증 시점 행동 탐색을 통해 평가합니다. 여기서는 고정된 에이전트가 실행에 앞서 후보 행동을 시뮬레이션하고 비교하기 위해 월드 모델을 사용합니다. 다양한 Office 작업에 걸쳐, 월드 모델이 안내하는 검증 시점 확장은 의사 결정 품질과 실행 견고성을 향상시킵니다.

English

Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact-preserving workflows. This challenge is particularly acute for computer-using scenarios, where real execution does not support counterfactual exploration, making large-scale trial-and-error learning and planning impractical despite the environment being fully digital and deterministic. We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action. CUWM adopts a two-stage factorization of UI dynamics: it first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot. CUWM is trained on offline UI transitions collected from agents interacting with real Microsoft Office applications, and further refined with a lightweight reinforcement learning stage that aligns textual transition predictions with the structural requirements of computer-using environments. We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution. Across a range of Office tasks, world-model-guided test-time scaling improves decision quality and execution robustness.