LearnAct:具備統一示範基準的少樣本移動端GUI代理
LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark
April 18, 2025
作者: Guangyi Liu, Pengxiang Zhao, Liang Liu, Zhiming Chen, Yuxiang Chai, Shuai Ren, Hao Wang, Shibo He, Wenchao Meng
cs.AI
摘要
移動端GUI代理在自動化任務方面展現出潛力,但在多樣化的現實場景中面臨泛化挑戰。傳統方法依賴於大規模數據集的預訓練或微調,難以應對移動應用的多樣性和用戶特定任務的複雜性。我們提出通過人類示範來增強移動GUI代理的能力,重點提升其在未見場景中的表現,而非追求通過更大數據集實現通用泛化。為實現這一範式,我們引入了LearnGUI,這是首個專為研究基於示範學習的移動GUI代理而設計的綜合數據集,包含2,252個離線任務和101個在線任務,並配有高質量的人類示範。我們進一步開發了LearnAct,這是一個精密的多元代理框架,能自動從示範中提取知識以提升任務完成度。該框架整合了三個專用代理:DemoParser用於知識提取,KnowSeeker負責相關知識檢索,ActExecutor則執行基於示範的任務。實驗結果顯示,在離線和在線評估中均取得了顯著的性能提升。在離線評估中,單次示範使模型性能提升,將Gemini-1.5-Pro的準確率從19.3%提高至51.7%。在在線評估中,我們的框架使UI-TARS-7B-SFT的任務成功率從18.1%提升至32.8%。LearnAct框架與LearnGUI基準的建立,標誌著基於示範的學習成為打造更具適應性、個性化且可部署的移動GUI代理的一個有前景的方向。
English
Mobile GUI agents show promise in automating tasks but face generalization
challenges in diverse real-world scenarios. Traditional approaches using
pre-training or fine-tuning with massive datasets struggle with the diversity
of mobile applications and user-specific tasks. We propose enhancing mobile GUI
agent capabilities through human demonstrations, focusing on improving
performance in unseen scenarios rather than pursuing universal generalization
through larger datasets. To realize this paradigm, we introduce LearnGUI, the
first comprehensive dataset specifically designed for studying
demonstration-based learning in mobile GUI agents, comprising 2,252 offline
tasks and 101 online tasks with high-quality human demonstrations. We further
develop LearnAct, a sophisticated multi-agent framework that automatically
extracts knowledge from demonstrations to enhance task completion. This
framework integrates three specialized agents: DemoParser for knowledge
extraction, KnowSeeker for relevant knowledge retrieval, and ActExecutor for
demonstration-enhanced task execution. Our experimental results show
significant performance gains in both offline and online evaluations. In
offline assessments, a single demonstration improves model performance,
increasing Gemini-1.5-Pro's accuracy from 19.3% to 51.7%. In online
evaluations, our framework enhances UI-TARS-7B-SFT's task success rate from
18.1% to 32.8%. LearnAct framework and LearnGUI benchmark establish
demonstration-based learning as a promising direction for more adaptable,
personalized, and deployable mobile GUI agents.