ChatPaper.aiChatPaper

D-Artemis:面向移动图形用户界面的多智能体协商认知框架

D-Artemis: A Deliberative Cognitive Framework for Mobile GUI Multi-Agents

September 26, 2025
作者: Hongze Mi, Yibo Feng, Wenjie Lu, Yuqi Wang, Jinyuan Li, Song Cao, He Cui, Tengfei Tian, Xuelin Zhang, Haotian Luo, Di Sun, Naiqiang Tan, Gang Pan
cs.AI

摘要

圖形用戶界面(GUI)代理旨在通過模擬用戶交互來自動化廣泛的人類任務。儘管技術迅速進步,當前方法仍面臨幾個關鍵挑戰:端到端訓練中的數據瓶頸、延遲錯誤檢測的高成本以及矛盾指導的風險。受人類認知循環——思考、對齊與反思的啟發,本文提出了一種新穎的審議框架——D-Artemis。D-Artemis利用細粒度的、應用特定的提示檢索機制來指導其決策過程。它還採用了主動的預執行對齊階段,其中思想-行動一致性(TAC)檢查模塊與行動校正代理(ACA)協同工作,以降低執行失敗的風險。執行後的狀態反思代理(SRA)完成了認知循環,使系統能夠從經驗中進行戰略性學習。關鍵的是,D-Artemis增強了通用多模態大語言模型(MLLMs)在GUI任務中的能力,而無需在複雜的軌跡數據集上進行訓練,展示了強大的泛化能力。D-Artemis在主要基準測試中均創下了新的最先進(SOTA)成績,在AndroidWorld上達到了75.8%的成功率,在ScreenSpot-V2上達到了96.8%。廣泛的消融研究進一步證明了框架中每個組件的顯著貢獻。
English
Graphical User Interface (GUI) agents aim to automate a wide spectrum of human tasks by emulating user interaction. Despite rapid advancements, current approaches are hindered by several critical challenges: data bottleneck in end-to-end training, high cost of delayed error detection, and risk of contradictory guidance. Inspired by the human cognitive loop of Thinking, Alignment, and Reflection, we present D-Artemis -- a novel deliberative framework in this paper. D-Artemis leverages a fine-grained, app-specific tip retrieval mechanism to inform its decision-making process. It also employs a proactive Pre-execution Alignment stage, where Thought-Action Consistency (TAC) Check module and Action Correction Agent (ACA) work in concert to mitigate the risk of execution failures. A post-execution Status Reflection Agent (SRA) completes the cognitive loop, enabling strategic learning from experience. Crucially, D-Artemis enhances the capabilities of general-purpose Multimodal large language models (MLLMs) for GUI tasks without the need for training on complex trajectory datasets, demonstrating strong generalization. D-Artemis establishes new state-of-the-art (SOTA) results across both major benchmarks, achieving a 75.8% success rate on AndroidWorld and 96.8% on ScreenSpot-V2. Extensive ablation studies further demonstrate the significant contribution of each component to the framework.
PDF82September 29, 2025