OS代理：基于多模态大语言模型的通用计算设备代理综述

摘要

长久以来，创造如《钢铁侠》中虚构的J.A.R.V.I.S般全能且多才多艺的AI助手之梦，一直激发着人们的无限遐想。随着（多模态）大语言模型（(M)LLMs）的演进，这一梦想正逐步接近现实，基于(M)LLM的智能体通过利用计算设备（如计算机和手机），在操作系统（OS）提供的环境与界面（如图形用户界面GUI）中自动化执行任务，取得了显著进展。本文全面综述了这些被称为OS智能体的先进技术。我们首先阐释OS智能体的基础，探讨其关键组成部分，包括环境、观察空间与动作空间，并概述其核心能力，如理解、规划与落地执行。随后，我们深入探讨构建OS智能体的方法论，聚焦于领域特定的基础模型与智能体框架。通过对评估协议与基准测试的详细回顾，展示了OS智能体在多样化任务中的表现评估。最后，我们讨论了当前面临的挑战，并指出了未来研究的有望方向，包括安全与隐私、个性化与自我进化。本综述旨在整合OS智能体研究现状，为学术探索与工业发展提供指导。我们维护了一个开源GitHub仓库，作为促进该领域进一步创新的动态资源。本文提供了一个9页的版本，已被ACL 2025接收，旨在为该领域提供一个简洁的概览。

English

The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey of these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, we discuss current challenges and identify promising directions for future research, including safety and privacy, personalization and self-evolution. This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field. We present a 9-page version of our work, accepted by ACL 2025, to provide a concise overview to the domain.

OS代理：基于多模态大语言模型的通用计算设备代理综述

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use

摘要

Support