LLM駆動型GUIエージェントによる電話自動化：進捗と展望の調査

要旨

大規模言語モデル（LLM）の急速な台頭に伴い、電話自動化は革新的な変化を遂げています。本論文では、LLM駆動型の電話GUIエージェントを体系的にレビューし、スクリプトベースの自動化から知的で適応的なシステムへの進化を明らかにします。まず、主要な課題である（i）汎用性の限界、（ii）高いメンテナンス負荷、（iii）意図理解の弱さを文脈化し、LLMが高度な言語理解、マルチモーダル知覚、堅牢な意思決定を通じてこれらの課題にどのように対処するかを示します。次に、基本的なエージェントフレームワーク（単一エージェント、マルチエージェント、計画先行型）、モデリングアプローチ（プロンプトエンジニアリング、トレーニングベース）、および重要なデータセットとベンチマークを網羅する分類体系を提案します。さらに、ユーザー意図とGUI操作を橋渡しするタスク固有のアーキテクチャ、教師ありファインチューニング、強化学習戦略について詳細に説明します。最後に、データセットの多様性、オンデバイス展開の効率性、ユーザー中心の適応、セキュリティ上の懸念などの未解決の課題について議論し、この急速に進化する分野に対する将来を見据えた洞察を提供します。構造化された概要を提供し、緊急の研究ギャップを特定することで、本論文はスケーラブルでユーザーフレンドリーな電話GUIエージェントの設計においてLLMを活用しようとする研究者や実務者にとっての確かなリファレンスとなります。

English

With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM-driven phone GUI agents, highlighting their evolution from script-based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and show how LLMs address these issues through advanced language understanding, multimodal perception, and robust decision-making. We then propose a taxonomy covering fundamental agent frameworks (single-agent, multi-agent, plan-then-act), modeling approaches (prompt engineering, training-based), and essential datasets and benchmarks. Furthermore, we detail task-specific architectures, supervised fine-tuning, and reinforcement learning strategies that bridge user intent and GUI operations. Finally, we discuss open challenges such as dataset diversity, on-device deployment efficiency, user-centric adaptation, and security concerns, offering forward-looking insights into this rapidly evolving field. By providing a structured overview and identifying pressing research gaps, this paper serves as a definitive reference for researchers and practitioners seeking to harness LLMs in designing scalable, user-friendly phone GUI agents.

LLM駆動型GUIエージェントによる電話自動化：進捗と展望の調査

LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects

要旨

Support