コンピュータ利用におけるスケーリングエージェントの驚異的な有効性

要旨

コンピュータ利用エージェント（CUA）は日常的なデジタルタスクの自動化に有望ですが、その信頼性の低さと高いばらつきが、長期的で複雑なタスクへの適用を妨げています。本論文では、Behavior Best-of-N（bBoN）という手法を提案します。この手法は、複数のロールアウトを生成し、エージェントのロールアウトを記述する行動ナラティブを用いてそれらを選択することで、エージェントをスケールさせます。これにより、広範な探索と原則に基づいた軌道選択が可能になり、堅牢性と成功率が大幅に向上します。OSWorldにおいて、我々のbBoNスケーリング手法は69.9%という新たな最先端（SoTA）を確立し、従来の手法を大幅に上回り、人間レベルの性能である72%に迫る結果を示しました。また、包括的なアブレーション実験により、主要な設計選択が検証されました。さらに、WindowsAgentArenaとAndroidWorldにおいて、異なるオペレーティングシステムへの強い汎化性能を実証しました。重要なのは、CUAのスケーリングが、適切に行われた場合に驚くほど効果的であることを我々の結果が示している点です。効果的なスケーリングには、構造化された軌道理解と選択が必要であり、bBoNはこれを実現するための実用的なフレームワークを提供します。

English

Computer-use agents (CUAs) hold promise for automating everyday digital tasks, but their unreliability and high variance hinder their application to long-horizon, complex tasks. We introduce Behavior Best-of-N (bBoN), a method that scales over agents by generating multiple rollouts and selecting among them using behavior narratives that describe the agents' rollouts. It enables both wide exploration and principled trajectory selection, substantially improving robustness and success rates. On OSWorld, our bBoN scaling method establishes a new state of the art (SoTA) at 69.9%, significantly outperforming prior methods and approaching human-level performance at 72%, with comprehensive ablations validating key design choices. We further demonstrate strong generalization results to different operating systems on WindowsAgentArena and AndroidWorld. Crucially, our results highlight the unreasonable effectiveness of scaling CUAs, when you do it right: effective scaling requires structured trajectory understanding and selection, and bBoN provides a practical framework to achieve this.

コンピュータ利用におけるスケーリングエージェントの驚異的な有効性

The Unreasonable Effectiveness of Scaling Agents for Computer Use

要旨

Support