ChatPaper.aiChatPaper

D-Artemis:面向移动GUI多智能体的审慎认知框架

D-Artemis: A Deliberative Cognitive Framework for Mobile GUI Multi-Agents

September 26, 2025
作者: Hongze Mi, Yibo Feng, Wenjie Lu, Yuqi Wang, Jinyuan Li, Song Cao, He Cui, Tengfei Tian, Xuelin Zhang, Haotian Luo, Di Sun, Naiqiang Tan, Gang Pan
cs.AI

摘要

图形用户界面(GUI)代理旨在通过模拟用户交互来自动化广泛的人类任务。尽管进展迅速,当前方法仍面临几个关键挑战:端到端训练中的数据瓶颈、延迟错误检测的高成本以及矛盾指导的风险。受人类认知循环——思考、对齐和反思的启发,本文提出了D-Artemis——一种新颖的审慎框架。D-Artemis利用细粒度的、应用特定的提示检索机制来指导其决策过程。它还采用了主动的预执行对齐阶段,其中思想-行动一致性(TAC)检查模块和行动校正代理(ACA)协同工作,以减少执行失败的风险。执行后的状态反思代理(SRA)完成了认知循环,实现了从经验中战略学习。重要的是,D-Artemis增强了通用多模态大语言模型(MLLMs)在GUI任务中的能力,而无需在复杂的轨迹数据集上进行训练,展示了强大的泛化能力。D-Artemis在主要基准测试中均取得了新的最先进(SOTA)成果,在AndroidWorld上达到了75.8%的成功率,在ScreenSpot-V2上达到了96.8%。广泛的消融研究进一步证明了框架中每个组件的显著贡献。
English
Graphical User Interface (GUI) agents aim to automate a wide spectrum of human tasks by emulating user interaction. Despite rapid advancements, current approaches are hindered by several critical challenges: data bottleneck in end-to-end training, high cost of delayed error detection, and risk of contradictory guidance. Inspired by the human cognitive loop of Thinking, Alignment, and Reflection, we present D-Artemis -- a novel deliberative framework in this paper. D-Artemis leverages a fine-grained, app-specific tip retrieval mechanism to inform its decision-making process. It also employs a proactive Pre-execution Alignment stage, where Thought-Action Consistency (TAC) Check module and Action Correction Agent (ACA) work in concert to mitigate the risk of execution failures. A post-execution Status Reflection Agent (SRA) completes the cognitive loop, enabling strategic learning from experience. Crucially, D-Artemis enhances the capabilities of general-purpose Multimodal large language models (MLLMs) for GUI tasks without the need for training on complex trajectory datasets, demonstrating strong generalization. D-Artemis establishes new state-of-the-art (SOTA) results across both major benchmarks, achieving a 75.8% success rate on AndroidWorld and 96.8% on ScreenSpot-V2. Extensive ablation studies further demonstrate the significant contribution of each component to the framework.
PDF82September 29, 2025