ChatPaper.aiChatPaper

OS-Symphony:面向稳健通用電腦使用代理的整體框架

OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

January 12, 2026
作者: Bowen Yang, Kaiming Jin, Zhenyu Wu, Zhaoyang Liu, Qiushi Sun, Zehao Li, JingJing Xie, Zhoumianze Liu, Fangzhi Xu, Kanzhi Cheng, Qingyun Li, Yian Wang, Yu Qiao, Zun Wang, Zichen Ding
cs.AI

摘要

儘管視覺語言模型(VLM)已顯著推動了計算機使用代理(CUA)的發展,現有框架在長時序工作流的穩健性與新領域的泛化能力方面仍存在不足。這些侷限性源於對歷史視覺上下文管理的細粒度控制不足,以及缺乏視覺感知的教程檢索機制。為解決這些問題,我們提出OS-Symphony——一個由協調器統籌兩大核心創新的整體框架,旨在實現穩健自動化:(1)反射記憶代理:通過里程碑驅動的長期記憶實現軌跡級自我修正,有效緩解長時序任務中的視覺上下文遺失問題;(2)多功能工具代理:配備採用SeeAct範式的多模態搜索器,可在基於瀏覽器的沙箱環境中合成即時視覺對齊教程,從而解決未見場景中的保真度問題。實驗結果表明,OS-Symphony在不同模型規模下均實現顯著性能提升,於三項線上基準測試中創下最新標竿成績,尤其在OSWorld上達到65.84%的優異表現。
English
While Vision-Language Models (VLMs) have significantly advanced Computer-Using Agents (CUAs), current frameworks struggle with robustness in long-horizon workflows and generalization in novel domains. These limitations stem from a lack of granular control over historical visual context curation and the absence of visual-aware tutorial retrieval. To bridge these gaps, we introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation: (1) a Reflection-Memory Agent that utilizes milestone-driven long-term memory to enable trajectory-level self-correction, effectively mitigating visual context loss in long-horizon tasks; (2) Versatile Tool Agents featuring a Multimodal Searcher that adopts a SeeAct paradigm to navigate a browser-based sandbox to synthesize live, visually aligned tutorials, thereby resolving fidelity issues in unseen scenarios. Experimental results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales, establishing new state-of-the-art results on three online benchmarks, notably achieving 65.84% on OSWorld.
PDF273January 31, 2026