OS 에이전트: 일반 컴퓨팅 장치를 위한 MLLM 기반 에이전트에 대한 조사

초록

아이언맨의 J.A.R.V.I.S와 같이 능력 있고 다재다능한 AI 어시스턴트를 만들고자 하는 꿈은 오랫동안 사람들의 상상력을 사로잡아 왔습니다. (멀티모달) 대형 언어 모델((M)LLM)의 진화와 함께, 이 꿈은 현실에 더 가까워졌습니다. 운영 체제(OS)가 제공하는 환경과 인터페이스(예: 그래픽 사용자 인터페이스(GUI)) 내에서 작동하여 컴퓨팅 장치(예: 컴퓨터 및 휴대폰)를 사용해 작업을 자동화하는 (M)LLM 기반 에이전트가 크게 발전했기 때문입니다. 본 논문은 이러한 고급 에이전트를 OS 에이전트로 지정하고, 이에 대한 포괄적인 조사를 제시합니다. 먼저 OS 에이전트의 기본 원리를 설명하고, 환경, 관찰 공간, 행동 공간을 포함한 주요 구성 요소를 탐구하며, 이해, 계획, 접지와 같은 필수 능력을 개요합니다. 그런 다음 도메인 특화 기반 모델과 에이전트 프레임워크에 초점을 맞춰 OS 에이전트를 구축하는 방법론을 검토합니다. 평가 프로토콜과 벤치마크에 대한 상세한 검토를 통해 다양한 작업에서 OS 에이전트가 어떻게 평가되는지 강조합니다. 마지막으로, 현재의 도전 과제를 논의하고, 안전 및 프라이버시, 개인화 및 자기 진화를 포함한 미래 연구의 유망한 방향을 식별합니다. 이 조사는 OS 에이전트 연구의 현황을 통합하여 학문적 탐구와 산업 발전을 안내할 통찰력을 제공하는 것을 목표로 합니다. 이 분야의 추가 혁신을 촉진하기 위해 동적 리소스로 오픈소스 GitHub 저장소를 유지합니다. ACL 2025에서 채택된 9페이지 버전의 작업을 제시하여 해당 도메인에 대한 간결한 개요를 제공합니다.

English

The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey of these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, we discuss current challenges and identify promising directions for future research, including safety and privacy, personalization and self-evolution. This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field. We present a 9-page version of our work, accepted by ACL 2025, to provide a concise overview to the domain.

OS 에이전트: 일반 컴퓨팅 장치를 위한 MLLM 기반 에이전트에 대한 조사

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use

초록

Support