D-Artemis: 모바일 GUI 다중 에이전트를 위한 의사결정 인지 프레임워크

초록

그래픽 사용자 인터페이스(GUI) 에이전트는 사용자 상호작용을 모방하여 다양한 인간 작업을 자동화하는 것을 목표로 합니다. 빠른 발전에도 불구하고, 현재의 접근 방식은 몇 가지 중요한 과제에 직면해 있습니다: 종단간 학습에서의 데이터 병목 현상, 지연된 오류 탐지의 높은 비용, 그리고 상반된 지침의 위험 등이 그것입니다. 인간의 사고(Thinking), 정렬(Alignment), 반영(Reflection)이라는 인지 루프에서 영감을 받아, 우리는 이 논문에서 새로운 숙고 프레임워크인 D-Artemis를 제안합니다. D-Artemis는 세분화된 앱별 팁 검색 메커니즘을 활용하여 의사 결정 과정을 지원합니다. 또한, 실행 전 정렬(Pre-execution Alignment) 단계에서 Thought-Action Consistency (TAC) Check 모듈과 Action Correction Agent (ACA)가 협력하여 실행 실패의 위험을 완화합니다. 실행 후 상태 반영 에이전트(Status Reflection Agent, SRA)는 인지 루프를 완성하며, 경험으로부터 전략적 학습을 가능하게 합니다. 특히, D-Artemis는 복잡한 궤적 데이터셋에 대한 학습 없이도 GUI 작업을 위한 일반 목적의 다중모드 대형 언어 모델(Multimodal Large Language Models, MLLMs)의 능력을 강화하며, 강력한 일반화 능력을 보여줍니다. D-Artemis는 주요 벤치마크에서 새로운 최첨단(state-of-the-art, SOTA) 결과를 달성했으며, AndroidWorld에서 75.8%, ScreenSpot-V2에서 96.8%의 성공률을 기록했습니다. 광범위한 어블레이션 연구는 각 구성 요소가 프레임워크에 기여하는 중요한 역할을 추가로 입증합니다.

English

Graphical User Interface (GUI) agents aim to automate a wide spectrum of human tasks by emulating user interaction. Despite rapid advancements, current approaches are hindered by several critical challenges: data bottleneck in end-to-end training, high cost of delayed error detection, and risk of contradictory guidance. Inspired by the human cognitive loop of Thinking, Alignment, and Reflection, we present D-Artemis -- a novel deliberative framework in this paper. D-Artemis leverages a fine-grained, app-specific tip retrieval mechanism to inform its decision-making process. It also employs a proactive Pre-execution Alignment stage, where Thought-Action Consistency (TAC) Check module and Action Correction Agent (ACA) work in concert to mitigate the risk of execution failures. A post-execution Status Reflection Agent (SRA) completes the cognitive loop, enabling strategic learning from experience. Crucially, D-Artemis enhances the capabilities of general-purpose Multimodal large language models (MLLMs) for GUI tasks without the need for training on complex trajectory datasets, demonstrating strong generalization. D-Artemis establishes new state-of-the-art (SOTA) results across both major benchmarks, achieving a 75.8% success rate on AndroidWorld and 96.8% on ScreenSpot-V2. Extensive ablation studies further demonstrate the significant contribution of each component to the framework.

D-Artemis: 모바일 GUI 다중 에이전트를 위한 의사결정 인지 프레임워크

D-Artemis: A Deliberative Cognitive Framework for Mobile GUI Multi-Agents

초록

Support