OSエージェント：汎用コンピューティングデバイス向けMLLMベースエージェントの調査

要旨

アイアンマンに登場するJ.A.R.V.I.Sのような、能力と汎用性を兼ね備えたAIアシスタントを作りたいという夢は、長い間人々の想像を掻き立ててきた。（マルチモーダル）大規模言語モデル（(M)LLM）の進化により、この夢は現実に近づいている。OSが提供する環境やインターフェース（例えばグラフィカルユーザーインターフェース（GUI））内で動作し、タスクを自動化する(M)LLMベースのエージェントが大幅に進歩したためだ。本論文では、これらの高度なエージェントを「OSエージェント」と称し、その包括的な調査を提示する。まず、OSエージェントの基本を明らかにし、環境、観測空間、行動空間といった主要な構成要素を探り、理解、計画、接地といった必須の能力を概説する。次に、OSエージェントを構築するための方法論を検討し、ドメイン固有の基盤モデルやエージェントフレームワークに焦点を当てる。評価プロトコルとベンチマークの詳細なレビューを通じて、OSエージェントが多様なタスクでどのように評価されているかを明らかにする。最後に、現在の課題を議論し、安全性とプライバシー、パーソナライゼーションと自己進化といった将来の研究の有望な方向性を特定する。本調査は、OSエージェント研究の現状を整理し、学術的な探求と産業の発展を導くための洞察を提供することを目的としている。この分野のさらなる革新を促進するため、動的なリソースとしてオープンソースのGitHubリポジトリを維持している。ACL 2025に採択された9ページ版の本論文は、この領域の簡潔な概要を提供するものである。

English

The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey of these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, we discuss current challenges and identify promising directions for future research, including safety and privacy, personalization and self-evolution. This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field. We present a 9-page version of our work, accepted by ACL 2025, to provide a concise overview to the domain.

OSエージェント：汎用コンピューティングデバイス向けMLLMベースエージェントの調査

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use

要旨

Support