효율적인 컴퓨터 사용 에이전트를 위한 단계별 최적화

초록

컴퓨터 사용 에이전트는 취약한 애플리케이션별 통합에 의존하는 대신 임의의 그래픽 사용자 인터페이스와 직접 상호작용할 수 있기 때문에 일반적인 소프트웨어 자동화를 위한 유망한 경로를 제공합니다. 벤치마크 성능의 최근 발전에도 불구하고, 대부분의 시스템이 거의 모든 상호작용 단계에서 대규모 멀티모달 모델을 호출하기 때문에 강력한 컴퓨터 사용 에이전트는 실제로 비용이 많이 들고 느린 편입니다. 우리는 이러한 균일한 컴퓨팅 자원 할당이 장기간 GUI 작업에 근본적으로 비효율적이라고 주장합니다. 이러한 작업 궤적은 매우 이질적입니다. 많은 단계는 일상적이며 더 작고 저렴한 정책으로 신뢰성 있게 처리될 수 있는 반면, 오류는 상대적으로 소수의 고위험 순간에 집중되는 경향이 있습니다. 컴퓨터 사용 벤치마크 전반에 걸쳐 이러한 실패는 반복적으로 두 가지 형태로 나타납니다. 하나는 에이전트가 루프를 돌거나 비효율적인 행동을 반복하거나 의미 있는 진전을 이루지 못하는 진행 정체(stall)이고, 다른 하나는 에이전트가 이미 사용자의 실제 목표에서 벗어난 후에도 국부적으로 타당한 행동을 계속하는 침묵적 의미론적 표류(semantic drift)입니다. 이러한 비효율성을 해결하기 위해, 우리는 작은 정책을 기본으로 실행하고 경량화된 학습 모니터가 위험 수준이 높아짐을 감지할 때만 더 강력한 모델로 전환(escalate)하는 이벤트 기반 단계별 캐스케이드(cascade)를 컴퓨터 사용 에이전트에 제안합니다. 우리의 프레임워크는 최근 추론-행동 기록에서 저하된 진행을 감지하고 복구를 트리거하는 'Stuck Monitor'와, 의미론적으로 의미 있는 체크포인트를 식별하여 표류 방지를 위한 희소 검증이 가장 효과적인 시점을 파악하는 'Milestone Monitor'라는 두 가지 상호 보완적인 신호를 결합합니다. 이 설계는 항상 켜진 최첨단 모델 추론을 진화하는 상호작용 과정에 걸친 적응형 주문형 컴퓨팅 자원 할당으로 전환합니다. 해당 프레임워크는 모듈식이며 배포 지향적입니다. 기존 컴퓨터 사용 에이전트의 기본 아키텍처를 변경하거나 대규모 모델을 재학습시키지 않고도其上에 계층화할 수 있습니다.

English

Computer-use agents provide a promising path toward general software automation because they can interact directly with arbitrary graphical user interfaces instead of relying on brittle, application-specific integrations. Despite recent advances in benchmark performance, strong computer-use agents remain expensive and slow in practice, since most systems invoke large multimodal models at nearly every interaction step. We argue that this uniform allocation of compute is fundamentally inefficient for long-horizon GUI tasks. Such trajectories are highly heterogeneous: many steps are routine and can be handled reliably by smaller, cheaper policies, while errors tend to concentrate at a relatively small number of high-risk moments. Across computer-use benchmarks, these failures repeatedly take two forms: progress stalls, where the agent loops, repeats ineffective actions, or fails to make meaningful progress, and silent semantic drift, where the agent continues taking locally plausible actions after already deviating from the user's true goal. To address this inefficiency, we propose an event-driven, step-level cascade for computer-use agents that runs a small policy by default and escalates to a stronger model only when lightweight learned monitors detect elevated risk. Our framework combines two complementary signals: a Stuck Monitor that detects degraded progress from recent reasoning-action history and triggers recovery, and a Milestone Monitor that identifies semantically meaningful checkpoints where sparse verification is most informative for catching drift. This design turns always-on frontier-model inference into adaptive, on-demand compute allocation over the course of an evolving interaction. The framework is modular and deployment-oriented: it can be layered on top of existing computer-use agents without changing the underlying agent architecture or retraining the large model.

효율적인 컴퓨터 사용 에이전트를 위한 단계별 최적화

Step-level Optimization for Efficient Computer-use Agents

초록

Support