Surfer 2：次世代クロスプラットフォームコンピュータ利用エージェント

要旨

Web、デスクトップ、モバイル環境を横断して汎化するエージェントの構築は、従来のシステムが環境固有のインターフェースに依存するためクロスプラットフォーム展開が制限され、未解決の課題となっている。本論文では、純粋に視覚観測のみから動作する統一アーキテクチャSurfer 2を提案し、これら3環境全てでState-of-the-Artの性能を達成する。Surfer 2は、階層的コンテキスト管理、分離された計画と実行、適応的回復を伴う自己検証を統合し、長いタスク時間軸にわたる信頼性の高い操作を実現する。本システムはWebVoyagerで97.1%、WebArenaで69.6%、OSWorldで60.1%、AndroidWorldで87.1%の精度を達成し、タスク特化的なファインチューニングなしで従来の全てのシステムを上回った。複数回の試行を許容した場合、Surfer 2は全てのベンチマークで人間の性能を凌駕する。これらの結果は、体系的なオーケストレーションが基盤モデルの能力を増幅し、視覚インタラクションのみによる汎用コンピュータ制御を可能にする一方で、パレート最適なコスト効率を達成するには次世代の視覚言語モデルが必要であることを示唆している。

English

Building agents that generalize across web, desktop, and mobile environments remains an open challenge, as prior systems rely on environment-specific interfaces that limit cross-platform deployment. We introduce Surfer 2, a unified architecture operating purely from visual observations that achieves state-of-the-art performance across all three environments. Surfer 2 integrates hierarchical context management, decoupled planning and execution, and self-verification with adaptive recovery, enabling reliable operation over long task horizons. Our system achieves 97.1% accuracy on WebVoyager, 69.6% on WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, outperforming all prior systems without task-specific fine-tuning. With multiple attempts, Surfer 2 exceeds human performance on all benchmarks. These results demonstrate that systematic orchestration amplifies foundation model capabilities and enables general-purpose computer control through visual interaction alone, while calling for a next-generation vision language model to achieve Pareto-optimal cost-efficiency.