大規模言語モデルを用いたGUIエージェント：調査

要旨

GUI（Graphical User Interfaces）は長い間、人間とコンピュータのインタラクションに中心的な役割を果たしており、直感的で視覚的な方法でデジタルシステムにアクセスし、相互作用する手段を提供してきました。特に多様なモダリティモデルを含むLLM（Large Language Models）の出現は、GUIの自動化の新時代をもたらしました。これらは、自然言語理解、コード生成、および視覚処理において優れた能力を示しています。これにより、複雑なGUI要素を解釈し、自然言語の指示に基づいて自律的にアクションを実行することができる新世代のLLM搭載GUIエージェントが可能となりました。これらのエージェントは、ユーザーが簡単な会話コマンドを通じて複雑な多段階のタスクを実行できるようにし、パラダイムシフトを表しています。これらのアプリケーションは、ウェブナビゲーション、モバイルアプリの相互作用、およびデスクトップ自動化を横断し、ソフトウェアとのインタラクション方法を革新する変革的なユーザーエクスペリエンスを提供しています。この新興分野は、研究と産業の両面で急速に進化しています。このトレンドを構造化した理解を提供するため、本論文では、LLM搭載GUIエージェントの包括的な調査を行い、その歴史的な進化、中核コンポーネント、および高度な技術を探求します。既存のGUIエージェントフレームワーク、特化したGUIエージェントのトレーニング用データの収集と利用、GUIタスク向けの大規模なアクションモデルの開発、および有効性を評価するために必要な評価尺度やベンチマークなど、研究課題に取り組みます。さらに、これらのエージェントによって可能となる新興アプリケーションについても検討します。この調査を通じて、研究の欠陥を特定し、この分野での将来の進展のためのロードマップを概説します。基礎知識と最新の進歩を統合することで、この研究は、研究者と実務者の両方が課題を克服し、LLM搭載GUIエージェントのフルポテンシャルを引き出す手助けとなることを目指しています。

English

GUIs have long been central to human-computer interaction, providing an intuitive and visually-driven way to access and interact with digital systems. The advent of LLMs, particularly multimodal models, has ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing. This has paved the way for a new generation of LLM-brained GUI agents capable of interpreting complex GUI elements and autonomously executing actions based on natural language instructions. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands. Their applications span across web navigation, mobile app interactions, and desktop automation, offering a transformative user experience that revolutionizes how individuals interact with software. This emerging field is rapidly advancing, with significant progress in both research and industry. To provide a structured understanding of this trend, this paper presents a comprehensive survey of LLM-brained GUI agents, exploring their historical evolution, core components, and advanced techniques. We address research questions such as existing GUI agent frameworks, the collection and utilization of data for training specialized GUI agents, the development of large action models tailored for GUI tasks, and the evaluation metrics and benchmarks necessary to assess their effectiveness. Additionally, we examine emerging applications powered by these agents. Through a detailed analysis, this survey identifies key research gaps and outlines a roadmap for future advancements in the field. By consolidating foundational knowledge and state-of-the-art developments, this work aims to guide both researchers and practitioners in overcoming challenges and unlocking the full potential of LLM-brained GUI agents.

大規模言語モデルを用いたGUIエージェント：調査

Large Language Model-Brained GUI Agents: A Survey

要旨

Support