D-Artemis: Een deliberatief cognitief raamwerk voor mobiele GUI multi-agenten

Samenvatting

Graphical User Interface (GUI)-agenten streven ernaar een breed scala aan menselijke taken te automatiseren door gebruikersinteractie na te bootsen. Ondanks snelle vooruitgang worden huidige benaderingen belemmerd door verschillende kritieke uitdagingen: een dataknelpunt in end-to-end training, de hoge kosten van vertraagde foutdetectie en het risico van tegenstrijdige richtlijnen. Geïnspireerd door de menselijke cognitieve lus van Denken, Afstemming en Reflectie, presenteren we in dit artikel D-Artemis -- een nieuw deliberatief raamwerk. D-Artemis maakt gebruik van een gedetailleerd, app-specifiek tip-retrievalmechanisme om zijn besluitvormingsproces te informeren. Het introduceert ook een proactieve Pre-execution Afstemmingsfase, waarin de Thought-Action Consistency (TAC) Check-module en de Action Correction Agent (ACA) samenwerken om het risico op uitvoeringsfouten te verminderen. Een post-execution Status Reflectie Agent (SRA) voltooit de cognitieve lus, waardoor strategisch leren van ervaring mogelijk wordt. Cruciaal is dat D-Artemis de mogelijkheden van algemene Multimodale grote taalmodellen (MLLMs) voor GUI-taken versterkt zonder de noodzaak van training op complexe trajectdatasets, wat een sterke generalisatie aantoont. D-Artemis vestigt nieuwe state-of-the-art (SOTA) resultaten op beide belangrijke benchmarks, met een slagingspercentage van 75,8% op AndroidWorld en 96,8% op ScreenSpot-V2. Uitgebreide ablatiestudies tonen verder de significante bijdrage van elke component aan het raamwerk aan.

English

Graphical User Interface (GUI) agents aim to automate a wide spectrum of human tasks by emulating user interaction. Despite rapid advancements, current approaches are hindered by several critical challenges: data bottleneck in end-to-end training, high cost of delayed error detection, and risk of contradictory guidance. Inspired by the human cognitive loop of Thinking, Alignment, and Reflection, we present D-Artemis -- a novel deliberative framework in this paper. D-Artemis leverages a fine-grained, app-specific tip retrieval mechanism to inform its decision-making process. It also employs a proactive Pre-execution Alignment stage, where Thought-Action Consistency (TAC) Check module and Action Correction Agent (ACA) work in concert to mitigate the risk of execution failures. A post-execution Status Reflection Agent (SRA) completes the cognitive loop, enabling strategic learning from experience. Crucially, D-Artemis enhances the capabilities of general-purpose Multimodal large language models (MLLMs) for GUI tasks without the need for training on complex trajectory datasets, demonstrating strong generalization. D-Artemis establishes new state-of-the-art (SOTA) results across both major benchmarks, achieving a 75.8% success rate on AndroidWorld and 96.8% on ScreenSpot-V2. Extensive ablation studies further demonstrate the significant contribution of each component to the framework.

D-Artemis: Een deliberatief cognitief raamwerk voor mobiele GUI multi-agenten

D-Artemis: A Deliberative Cognitive Framework for Mobile GUI Multi-Agents

Samenvatting

Support