计算机使用中扩展代理的非凡效能
The Unreasonable Effectiveness of Scaling Agents for Computer Use
October 2, 2025
作者: Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, Xin Eric Wang
cs.AI
摘要
计算机使用代理(CUAs)在自动化日常数字任务方面展现出巨大潜力,但其不可靠性和高变异性阻碍了其在长期复杂任务中的应用。我们提出了行为最优N选(bBoN)方法,该方法通过生成多个执行轨迹并利用描述代理执行过程的行为叙述进行选择,实现了对代理的规模化扩展。bBoN既支持广泛探索,又实现了有原则的轨迹选择,显著提升了鲁棒性和成功率。在OSWorld平台上,我们的bBoN扩展方法以69.9%的成绩刷新了当前最佳水平(SoTA),大幅超越先前方法,并接近72%的人类水平表现,全面的消融实验验证了关键设计选择的有效性。我们进一步展示了在WindowsAgentArena和AndroidWorld平台上对不同操作系统的强大泛化能力。重要的是,我们的研究结果凸显了在正确实施时,扩展CUAs的惊人效果:有效的扩展需要结构化的轨迹理解与选择,而bBoN为实现这一目标提供了一个实用框架。
English
Computer-use agents (CUAs) hold promise for automating everyday digital
tasks, but their unreliability and high variance hinder their application to
long-horizon, complex tasks. We introduce Behavior Best-of-N (bBoN), a method
that scales over agents by generating multiple rollouts and selecting among
them using behavior narratives that describe the agents' rollouts. It enables
both wide exploration and principled trajectory selection, substantially
improving robustness and success rates. On OSWorld, our bBoN scaling method
establishes a new state of the art (SoTA) at 69.9%, significantly outperforming
prior methods and approaching human-level performance at 72%, with
comprehensive ablations validating key design choices. We further demonstrate
strong generalization results to different operating systems on
WindowsAgentArena and AndroidWorld. Crucially, our results highlight the
unreasonable effectiveness of scaling CUAs, when you do it right: effective
scaling requires structured trajectory understanding and selection, and bBoN
provides a practical framework to achieve this.