OS代理:基于多模态大语言模型的通用计算设备代理综述
OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use
August 6, 2025
作者: Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, Fei Wu
cs.AI
摘要
长久以来,创造如《钢铁侠》中虚构的J.A.R.V.I.S般全能且多才多艺的AI助手之梦,一直激发着人们的无限遐想。随着(多模态)大语言模型((M)LLMs)的演进,这一梦想正逐步接近现实,基于(M)LLM的智能体通过利用计算设备(如计算机和手机),在操作系统(OS)提供的环境与界面(如图形用户界面GUI)中自动化执行任务,取得了显著进展。本文全面综述了这些被称为OS智能体的先进技术。我们首先阐释OS智能体的基础,探讨其关键组成部分,包括环境、观察空间与动作空间,并概述其核心能力,如理解、规划与落地执行。随后,我们深入探讨构建OS智能体的方法论,聚焦于领域特定的基础模型与智能体框架。通过对评估协议与基准测试的详细回顾,展示了OS智能体在多样化任务中的表现评估。最后,我们讨论了当前面临的挑战,并指出了未来研究的有望方向,包括安全与隐私、个性化与自我进化。本综述旨在整合OS智能体研究现状,为学术探索与工业发展提供指导。我们维护了一个开源GitHub仓库,作为促进该领域进一步创新的动态资源。本文提供了一个9页的版本,已被ACL 2025接收,旨在为该领域提供一个简洁的概览。
English
The dream to create AI assistants as capable and versatile as the fictional
J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution
of (multi-modal) large language models ((M)LLMs), this dream is closer to
reality, as (M)LLM-based Agents using computing devices (e.g., computers and
mobile phones) by operating within the environments and interfaces (e.g.,
Graphical User Interface (GUI)) provided by operating systems (OS) to automate
tasks have significantly advanced. This paper presents a comprehensive survey
of these advanced agents, designated as OS Agents. We begin by elucidating the
fundamentals of OS Agents, exploring their key components including the
environment, observation space, and action space, and outlining essential
capabilities such as understanding, planning, and grounding. We then examine
methodologies for constructing OS Agents, focusing on domain-specific
foundation models and agent frameworks. A detailed review of evaluation
protocols and benchmarks highlights how OS Agents are assessed across diverse
tasks. Finally, we discuss current challenges and identify promising directions
for future research, including safety and privacy, personalization and
self-evolution. This survey aims to consolidate the state of OS Agents
research, providing insights to guide both academic inquiry and industrial
development. An open-source GitHub repository is maintained as a dynamic
resource to foster further innovation in this field. We present a 9-page
version of our work, accepted by ACL 2025, to provide a concise overview to the
domain.