面向高效计算机使用代理的步骤级优化

摘要

计算机使用代理为实现通用软件自动化提供了一条前景广阔的路径，因为它们能直接与任意图形用户界面交互，而无需依赖脆弱的、特定于应用的集成方案。尽管基准测试性能近期有所突破，但强大的计算机使用代理在实际应用中仍存在成本高昂和速度缓慢的问题，这主要是因为大多数系统几乎在每个交互步骤都需要调用大型多模态模型。我们认为这种均匀分配计算资源的方式对长周期GUI任务存在根本性低效问题。此类任务轨迹具有高度异质性：多数步骤属于常规操作，可通过更小型、更经济的策略可靠处理，而错误往往集中在少数高风险节点。纵观各类计算机使用基准测试，这些故障主要表现为两种形式：进度停滞（代理陷入循环、重复无效操作或无法实现实质性进展）和隐性语义漂移（代理在偏离用户真实目标后仍持续执行局部合理的操作）。为解决这一低效问题，我们提出了一种事件驱动的阶梯式计算机使用代理架构：默认运行轻量策略，仅当轻量级学习监测器检测到风险升高时，才升级至更强模型。该框架融合了两种互补信号：通过停滞监测器检测近期推理-操作历史中的进度退化并触发恢复机制，通过里程碑监测器识别语义关键节点，在此类稀疏验证点进行有效性校验以捕捉漂移现象。这种设计将始终开启的前沿模型推理转变为随着交互进程动态调整的按需计算资源分配方案。该框架采用模块化设计并面向实际部署：可在不改变现有代理底层架构或重新训练大模型的前提下，直接叠加于现有计算机使用代理系统之上。

English

Computer-use agents provide a promising path toward general software automation because they can interact directly with arbitrary graphical user interfaces instead of relying on brittle, application-specific integrations. Despite recent advances in benchmark performance, strong computer-use agents remain expensive and slow in practice, since most systems invoke large multimodal models at nearly every interaction step. We argue that this uniform allocation of compute is fundamentally inefficient for long-horizon GUI tasks. Such trajectories are highly heterogeneous: many steps are routine and can be handled reliably by smaller, cheaper policies, while errors tend to concentrate at a relatively small number of high-risk moments. Across computer-use benchmarks, these failures repeatedly take two forms: progress stalls, where the agent loops, repeats ineffective actions, or fails to make meaningful progress, and silent semantic drift, where the agent continues taking locally plausible actions after already deviating from the user's true goal. To address this inefficiency, we propose an event-driven, step-level cascade for computer-use agents that runs a small policy by default and escalates to a stronger model only when lightweight learned monitors detect elevated risk. Our framework combines two complementary signals: a Stuck Monitor that detects degraded progress from recent reasoning-action history and triggers recovery, and a Milestone Monitor that identifies semantically meaningful checkpoints where sparse verification is most informative for catching drift. This design turns always-on frontier-model inference into adaptive, on-demand compute allocation over the course of an evolving interaction. The framework is modular and deployment-oriented: it can be layered on top of existing computer-use agents without changing the underlying agent architecture or retraining the large model.

面向高效计算机使用代理的步骤级优化

Step-level Optimization for Efficient Computer-use Agents

摘要

Support