OS 代理：基於多模態大語言模型的通用計算設備代理綜述

摘要

創造如同《鋼鐵人》中虛構角色J.A.R.V.I.S.般能力全面且多才多藝的人工智慧助手，這一夢想長久以來激發了無數人的想像。隨著（多模態）大型語言模型（(M)LLMs）的演進，這一夢想正逐步接近現實，基於(M)LLM的代理通過在操作系統（OS）提供的環境和界面（如圖形用戶界面(GUI)）中操作計算設備（如電腦和手機）來自動化任務，已取得了顯著進展。本文對這些被稱為OS代理的高級代理進行了全面調查。我們首先闡明OS代理的基本原理，探討其關鍵組成部分，包括環境、觀察空間和動作空間，並概述理解、規劃和接地等核心能力。接著，我們考察構建OS代理的方法論，重點關注領域特定的基礎模型和代理框架。通過詳細回顧評估協議和基準測試，我們展示了OS代理在多樣化任務中的評估方式。最後，我們討論了當前面臨的挑戰，並指出了未來研究的有前景方向，包括安全與隱私、個性化與自我進化。本調查旨在整合OS代理研究的現狀，為學術探索與工業發展提供指導。我們維護了一個開源的GitHub倉庫，作為促進該領域進一步創新的動態資源。我們還提供了一份被ACL 2025接受的9頁版本工作，以簡明扼要地概述該領域。

English

The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey of these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, we discuss current challenges and identify promising directions for future research, including safety and privacy, personalization and self-evolution. This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field. We present a 9-page version of our work, accepted by ACL 2025, to provide a concise overview to the domain.

OS 代理：基於多模態大語言模型的通用計算設備代理綜述

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use

摘要

Support