GUI-CIDER: 通过因果内化与密度感知的示例重选进行GUI代理的中期训练

摘要

尽管多模态大语言模型在构建图形用户界面（GUI）智能体方面取得了快速进展，但其在真实场景中的任务完成能力从根本上受到缺乏GUI操作世界知识的制约。现有解决方案通常依赖于昂贵的多智能体架构或传统的后训练范式（如监督微调（SFT）和强化学习（RL））。然而，后训练仅能让智能体通过行为标注或奖励信号隐式吸收世界知识，导致低效的轨迹记忆而非真正理解。因此，亟需一种能够显式学习此类知识的方法。为此，我们提出GUI-CIDER，一种通过因果内化与密度感知样例重选实现显式内化GUI世界知识的中期训练方法。GUI-CIDER包含三个阶段：（1）数据合成，从GUI轨迹中提炼静态规划知识与动态因果知识并将其转化为文本；（2）样例重选，通过奖励因果结构、惩罚语义冗余来筛选语料库；（3）中期训练，利用精炼数据嵌入所获取的知识。在两个GUI知识基准测试和三个任务完成基准测试上的大量实验表明，GUI-CIDER能够持续提升智能体对GUI操作的理解能力及其任务成功率。代码已开源：https://github.com/Wuzheng02/GUI-CIDER。

English

Despite the rapid progress of multimodal large language models in building Graphical User Interface (GUI) agents, their real-world task completion is fundamentally bottlenecked by a lack of world knowledge about GUI operations. Existing solutions typically rely on expensive multi-agent scaffolding or conventional post-training paradigms, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). However, post-training only allows agents to implicitly absorb world knowledge through action annotations or reward signals, leading to inefficient trajectory memorization rather than genuine comprehension. Therefore, an approach that enables explicit learning of this knowledge is imperative. To this end, we propose GUI-CIDER, a mid-training method that explicitly internalizes GUI world knowledge through Causal Internalization and Density-aware Exemplar Reselection. GUI-CIDER operates in three stages: (1) data synthesis, which distills static planning and dynamic causal knowledge from GUI trajectories into text; (2) exemplar reselection, which filters the corpus by rewarding causal structures and penalizing semantic redundancy; and (3) mid-training, where the refined data is used to embed the acquired knowledge. Extensive experiments on two GUI knowledge benchmarks and three task completion benchmarks demonstrate that GUI-CIDER consistently improves both the agent's understanding of GUI operations and its task success rates.The codes are available at https://github.com/Wuzheng02/GUI-CIDER.