ChatPaper.aiChatPaper

OS-Symphony:面向稳健通用计算机使用智能体的整体框架

OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

January 12, 2026
作者: Bowen Yang, Kaiming Jin, Zhenyu Wu, Zhaoyang Liu, Qiushi Sun, Zehao Li, JingJing Xie, Zhoumianze Liu, Fangzhi Xu, Kanzhi Cheng, Qingyun Li, Yian Wang, Yu Qiao, Zun Wang, Zichen Ding
cs.AI

摘要

尽管视觉语言模型(VLMs)显著推动了计算机使用智能体(CUAs)的发展,但现有框架在长流程任务的鲁棒性和新领域泛化能力方面仍存在不足。这些局限源于对历史视觉上下文管理的精细化控制缺失,以及缺乏视觉感知的教程检索机制。为弥补这些缺陷,我们提出了OS-Symphony整体框架,其核心协调器整合了两大创新:一是采用里程碑驱动长期记忆的反思记忆智能体,通过轨迹级自我修正有效缓解长流程任务中的视觉上下文丢失问题;二是配备多模态搜索器的多功能工具智能体,基于SeeAct范式在浏览器沙箱中合成实时视觉对齐教程,从而解决未知场景下的保真度问题。实验结果表明,OS-Symphony在不同模型规模下均实现显著性能提升,在三大在线基准测试中创下新纪录,尤其在OSWorld上达到65.84%的优异表现。
English
While Vision-Language Models (VLMs) have significantly advanced Computer-Using Agents (CUAs), current frameworks struggle with robustness in long-horizon workflows and generalization in novel domains. These limitations stem from a lack of granular control over historical visual context curation and the absence of visual-aware tutorial retrieval. To bridge these gaps, we introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation: (1) a Reflection-Memory Agent that utilizes milestone-driven long-term memory to enable trajectory-level self-correction, effectively mitigating visual context loss in long-horizon tasks; (2) Versatile Tool Agents featuring a Multimodal Searcher that adopts a SeeAct paradigm to navigate a browser-based sandbox to synthesize live, visually aligned tutorials, thereby resolving fidelity issues in unseen scenarios. Experimental results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales, establishing new state-of-the-art results on three online benchmarks, notably achieving 65.84% on OSWorld.
PDF273January 31, 2026