ChatPaper.aiChatPaper

Surfer 2:新一代跨平台計算機使用代理

Surfer 2: The Next Generation of Cross-Platform Computer Use Agents

October 22, 2025
作者: Mathieu Andreux, Märt Bakler, Yanael Barbier, Hamza Benchekroun, Emilien Biré, Antoine Bonnet, Riaz Bordie, Nathan Bout, Matthias Brunel, Aleix Cambray, Pierre-Louis Cedoz, Antoine Chassang, Gautier Cloix, Ethan Connelly, Alexandra Constantinou, Ramzi De Coster, Hubert de la Jonquiere, Aurélien Delfosse, Maxime Delpit, Alexis Deprez, Augustin Derupti, Mathieu Diaz, Shannon D'Souza, Julie Dujardin, Abai Edmund, Michael Eickenberg, Armand Fatalot, Wissem Felissi, Isaac Herring, Xavier Koegler, Erwan Le Jumeau de Kergaradec, Aurélien Lac, Maxime Langevin, Corentin Lauverjat, Antonio Loison, Avshalom Manevich, Axel Moyal, Axel Nguyen Kerbel, Marinela Parovic, Julien Revelle, Guillaume Richard, Mats Richter, Ronan Riochet, María Santos, Romain Savidan, Laurent Sifre, Maxime Theillard, Marc Thibault, Ivan Valentini, Tony Wu, Laura Yie, Kai Yuan, Jevgenij Zubovskij
cs.AI

摘要

建構能夠在網頁、桌面及行動裝置環境間通用化的智慧體仍是開放性挑戰,現有系統多依賴特定環境介面而限制跨平台部署。我們提出Surfer 2——純粹基於視覺觀測的統一架構,在三類環境中均實現最先進效能。該架構整合階層式情境管理、解耦的規劃與執行模組,以及具備自適應恢復機制的自我驗證功能,從而在長時序任務中實現可靠操作。我們的系統在WebVoyager達到97.1%準確率、WebArena達69.6%、OSWorld達60.1%、AndroidWorld達87.1%,無需任務特定微調即超越所有既有系統。透過多重嘗試機制,Surfer 2在所有基準測試中均超越人類表現。這些成果證明系統化協調機制能放大基礎模型能力,實現純視覺互動的通用電腦控制,同時呼籲需發展新一代視覺語言模型以達成帕雷托最優的成本效益。
English
Building agents that generalize across web, desktop, and mobile environments remains an open challenge, as prior systems rely on environment-specific interfaces that limit cross-platform deployment. We introduce Surfer 2, a unified architecture operating purely from visual observations that achieves state-of-the-art performance across all three environments. Surfer 2 integrates hierarchical context management, decoupled planning and execution, and self-verification with adaptive recovery, enabling reliable operation over long task horizons. Our system achieves 97.1% accuracy on WebVoyager, 69.6% on WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, outperforming all prior systems without task-specific fine-tuning. With multiple attempts, Surfer 2 exceeds human performance on all benchmarks. These results demonstrate that systematic orchestration amplifies foundation model capabilities and enables general-purpose computer control through visual interaction alone, while calling for a next-generation vision language model to achieve Pareto-optimal cost-efficiency.
PDF382December 2, 2025