面向高效计算机使用代理的步骤级优化

摘要

计算机使用代理为实现通用软件自动化提供了可行路径，因为它们能直接与任意图形用户界面交互，而无需依赖脆弱的特定应用集成。尽管基准测试性能近期有所提升，但强大的计算机使用代理在实际应用中仍存在成本高昂、响应迟缓的问题，这主要是因为大多数系统在几乎每个交互步骤都需调用大型多模态模型。我们认为，这种均匀分配计算资源的方式对于长周期GUI任务存在根本性低效问题。此类任务轨迹具有高度异质性：多数步骤属于常规操作，可通过更小型、低成本策略可靠处理；而错误往往集中在少数高风险节点。在计算机使用基准测试中，故障主要表现为两种形式：进度停滞（代理陷入循环、重复无效操作或无法取得实质性进展）和隐性语义漂移（代理在偏离用户真实目标后仍持续执行局部合理的操作）。为解决这一低效问题，我们提出面向计算机使用代理的事件驱动型阶梯式处理框架：默认运行轻量策略，仅当轻量级学习监测器检测到风险升高时，才升级至更强模型。该框架融合两种互补信号：基于近期推理行动历史检测进度异常的停滞监测器（触发恢复机制），以及识别语义关键节点的里程碑监测器（在稀疏验证最能有效捕捉漂移的检查点进行确认）。这一设计将始终开启的前沿模型推理转变为在动态交互过程中按需分配的计算资源。该框架采用模块化设计且面向实际部署：无需改变现有代理架构或重新训练大模型，即可在既有计算机使用代理基础上实现分层叠加。

English

Computer-use agents provide a promising path toward general software automation because they can interact directly with arbitrary graphical user interfaces instead of relying on brittle, application-specific integrations. Despite recent advances in benchmark performance, strong computer-use agents remain expensive and slow in practice, since most systems invoke large multimodal models at nearly every interaction step. We argue that this uniform allocation of compute is fundamentally inefficient for long-horizon GUI tasks. Such trajectories are highly heterogeneous: many steps are routine and can be handled reliably by smaller, cheaper policies, while errors tend to concentrate at a relatively small number of high-risk moments. Across computer-use benchmarks, these failures repeatedly take two forms: progress stalls, where the agent loops, repeats ineffective actions, or fails to make meaningful progress, and silent semantic drift, where the agent continues taking locally plausible actions after already deviating from the user's true goal. To address this inefficiency, we propose an event-driven, step-level cascade for computer-use agents that runs a small policy by default and escalates to a stronger model only when lightweight learned monitors detect elevated risk. Our framework combines two complementary signals: a Stuck Monitor that detects degraded progress from recent reasoning-action history and triggers recovery, and a Milestone Monitor that identifies semantically meaningful checkpoints where sparse verification is most informative for catching drift. This design turns always-on frontier-model inference into adaptive, on-demand compute allocation over the course of an evolving interaction. The framework is modular and deployment-oriented: it can be layered on top of existing computer-use agents without changing the underlying agent architecture or retraining the large model.

面向高效计算机使用代理的步骤级优化

Step-level Optimization for Efficient Computer-use Agents

摘要

Support