컴퓨터 활용을 위한 스케일링 에이전트의 비합리적 효용성

초록

컴퓨터 사용 에이전트(CUAs)는 일상적인 디지털 작업을 자동화하는 데 유망한 가능성을 보여주지만, 그들의 신뢰성 부족과 높은 변동성은 장기적이고 복잡한 작업에의 적용을 방해합니다. 우리는 Behavior Best-of-N(bBoN)이라는 방법을 소개합니다. 이 방법은 여러 롤아웃을 생성하고 에이전트의 롤아웃을 설명하는 행동 서술을 통해 그 중에서 선택함으로써 에이전트를 확장합니다. 이를 통해 광범위한 탐색과 원칙에 기반한 궤적 선택이 가능해져, 견고성과 성공률이 크게 향상됩니다. OSWorld에서 우리의 bBoN 확장 방법은 69.9%로 새로운 최첨단 기술(SoTA)을 달성하며, 이전 방법들을 크게 능가하고 72%에 달하는 인간 수준의 성능에 근접합니다. 또한, 포괄적인 절제 실험을 통해 주요 설계 선택의 타당성을 검증했습니다. 우리는 더 나아가 WindowsAgentArena와 AndroidWorld에서 다양한 운영 체제에 대한 강력한 일반화 결과를 보여줍니다. 무엇보다도, 우리의 결과는 CUAs를 효과적으로 확장할 때의 비합리적인 효율성을 강조합니다: 효과적인 확장은 구조화된 궤적 이해와 선택을 필요로 하며, bBoN은 이를 달성하기 위한 실용적인 프레임워크를 제공합니다.

English

Computer-use agents (CUAs) hold promise for automating everyday digital tasks, but their unreliability and high variance hinder their application to long-horizon, complex tasks. We introduce Behavior Best-of-N (bBoN), a method that scales over agents by generating multiple rollouts and selecting among them using behavior narratives that describe the agents' rollouts. It enables both wide exploration and principled trajectory selection, substantially improving robustness and success rates. On OSWorld, our bBoN scaling method establishes a new state of the art (SoTA) at 69.9%, significantly outperforming prior methods and approaching human-level performance at 72%, with comprehensive ablations validating key design choices. We further demonstrate strong generalization results to different operating systems on WindowsAgentArena and AndroidWorld. Crucially, our results highlight the unreasonable effectiveness of scaling CUAs, when you do it right: effective scaling requires structured trajectory understanding and selection, and bBoN provides a practical framework to achieve this.

컴퓨터 활용을 위한 스케일링 에이전트의 비합리적 효용성

The Unreasonable Effectiveness of Scaling Agents for Computer Use

초록

Support