GUI-CIDER: Tussentijdse training van GUI-agenten via causale internalisatie en dichtheidsbewuste exemplarherselectie

Samenvatting

Ondanks de snelle vooruitgang van multimodale grote taalmodellen bij het bouwen van grafische gebruikersinterface (GUI)-agenten, wordt hun taakvoltooiing in de echte wereld fundamenteel beperkt door een gebrek aan wereldkennis over GUI-operaties. Bestaande oplossingen vertrouwen doorgaans op dure multi-agent scaffolding of conventionele post-training paradigma's, zoals Begeleid Fijnstemmen (SFT) en Versterkend Leren (RL). Post-training stelt agenten echter alleen in staat om wereldkennis impliciet op te nemen via actieannotaties of beloningssignalen, wat leidt tot inefficiënt trajectgeheugen in plaats van echt begrip. Daarom is een aanpak die expliciet leren van deze kennis mogelijk maakt, noodzakelijk. Hiertoe stellen wij GUI-CIDER voor, een mid-trainingmethode die expliciet GUI-wereldkennis internaliseert door middel van Causale Internalisering en Dichtheidsbewuste Voorbeeldherselectie. GUI-CIDER werkt in drie fasen: (1) datasynthese, die statische plannings- en dynamische causale kennis uit GUI-trajecten destilleert naar tekst; (2) voorbeeldherselectie, die het corpus filtert door causale structuren te belonen en semantische redundantie te bestraffen; en (3) mid-training, waarbij de verfijnde data wordt gebruikt om de verworven kennis in te bedden. Uitgebreide experimenten op twee GUI-kennisbenchmarks en drie taakvoltooiingsbenchmarks tonen aan dat GUI-CIDER zowel het begrip van de agent van GUI-operaties als de taaksuccespercentages consistent verbetert. De code is beschikbaar op https://github.com/Wuzheng02/GUI-CIDER.

English

Despite the rapid progress of multimodal large language models in building Graphical User Interface (GUI) agents, their real-world task completion is fundamentally bottlenecked by a lack of world knowledge about GUI operations. Existing solutions typically rely on expensive multi-agent scaffolding or conventional post-training paradigms, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). However, post-training only allows agents to implicitly absorb world knowledge through action annotations or reward signals, leading to inefficient trajectory memorization rather than genuine comprehension. Therefore, an approach that enables explicit learning of this knowledge is imperative. To this end, we propose GUI-CIDER, a mid-training method that explicitly internalizes GUI world knowledge through Causal Internalization and Density-aware Exemplar Reselection. GUI-CIDER operates in three stages: (1) data synthesis, which distills static planning and dynamic causal knowledge from GUI trajectories into text; (2) exemplar reselection, which filters the corpus by rewarding causal structures and penalizing semantic redundancy; and (3) mid-training, where the refined data is used to embed the acquired knowledge. Extensive experiments on two GUI knowledge benchmarks and three task completion benchmarks demonstrate that GUI-CIDER consistently improves both the agent's understanding of GUI operations and its task success rates.The codes are available at https://github.com/Wuzheng02/GUI-CIDER.