冲浪者2:跨平台计算机使用代理的新一代
Surfer 2: The Next Generation of Cross-Platform Computer Use Agents
October 22, 2025
作者: Mathieu Andreux, Märt Bakler, Yanael Barbier, Hamza Benchekroun, Emilien Biré, Antoine Bonnet, Riaz Bordie, Nathan Bout, Matthias Brunel, Aleix Cambray, Pierre-Louis Cedoz, Antoine Chassang, Gautier Cloix, Ethan Connelly, Alexandra Constantinou, Ramzi De Coster, Hubert de la Jonquiere, Aurélien Delfosse, Maxime Delpit, Alexis Deprez, Augustin Derupti, Mathieu Diaz, Shannon D'Souza, Julie Dujardin, Abai Edmund, Michael Eickenberg, Armand Fatalot, Wissem Felissi, Isaac Herring, Xavier Koegler, Erwan Le Jumeau de Kergaradec, Aurélien Lac, Maxime Langevin, Corentin Lauverjat, Antonio Loison, Avshalom Manevich, Axel Moyal, Axel Nguyen Kerbel, Marinela Parovic, Julien Revelle, Guillaume Richard, Mats Richter, Ronan Riochet, María Santos, Romain Savidan, Laurent Sifre, Maxime Theillard, Marc Thibault, Ivan Valentini, Tony Wu, Laura Yie, Kai Yuan, Jevgenij Zubovskij
cs.AI
摘要
构建能够跨网页、桌面和移动环境泛化的智能体仍是一个开放挑战,因为现有系统依赖特定环境接口,限制了跨平台部署能力。我们推出Surfer 2——一个纯粹基于视觉观察的统一架构,在三大环境中均实现最先进性能。该系统融合了分层上下文管理、解耦的规划与执行机制,以及具备自适应恢复能力的自我验证功能,从而在长周期任务中实现可靠操作。我们的系统在WebVoyager上达到97.1%准确率,WebArena达69.6%,OSWorld达60.1%,AndroidWorld达87.1%,无需任务特定微调即超越所有现有系统。通过多轮尝试,Surfer 2在所有基准测试中均超越人类表现。这些成果证明:系统化编排能放大基础模型能力,仅通过视觉交互即可实现通用计算机控制,同时呼吁开发新一代视觉语言模型以实现帕累托最优的成本效益。
English
Building agents that generalize across web, desktop, and mobile environments
remains an open challenge, as prior systems rely on environment-specific
interfaces that limit cross-platform deployment. We introduce Surfer 2, a
unified architecture operating purely from visual observations that achieves
state-of-the-art performance across all three environments. Surfer 2 integrates
hierarchical context management, decoupled planning and execution, and
self-verification with adaptive recovery, enabling reliable operation over long
task horizons. Our system achieves 97.1% accuracy on WebVoyager, 69.6% on
WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, outperforming all prior
systems without task-specific fine-tuning. With multiple attempts, Surfer 2
exceeds human performance on all benchmarks. These results demonstrate that
systematic orchestration amplifies foundation model capabilities and enables
general-purpose computer control through visual interaction alone, while
calling for a next-generation vision language model to achieve Pareto-optimal
cost-efficiency.