GUI-CIDER:通過因果內化與密度感知的範例重新選擇進行GUI代理的中期訓練
GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection
May 27, 2026
作者: Zheng Wu, Chengcheng Han, Zhengxi Lu, Tianjie Ju, Yanyu Chen, Qi Gu, Xunliang Cai, Zhuosheng Zhang
cs.AI
摘要
儘管多模態大語言模型在構建圖形使用者介面(GUI)智能體方面進展迅速,但其真實世界的任務完成能力根本上受困於缺乏關於GUI操作的世界知識。現有解決方案通常依賴於昂貴的多智能體框架或傳統的後訓練範式,例如監督式微調(SFT)與強化學習(RL)。然而,後訓練僅能讓智能體透過動作標註或獎勵訊號隱式吸收世界知識,導致低效的軌跡記憶而非真正的理解。因此,亟需一種能夠顯式學習此類知識的方法。為此,我們提出GUI-CIDER,一種透過因果內化(Causal Internalization)與密度感知示例重選(Density-aware Exemplar Reselection)來顯式內化GUI世界知識的中間訓練方法。GUI-CIDER 包含三個階段:(1)資料合成,從GUI軌跡中提煉靜態規劃知識與動態因果知識,並將其轉化為文本;(2)示例重選,透過獎勵因果結構、懲罰語義冗餘來過濾語料庫;(3)中間訓練,利用精煉後的資料嵌入所習得知識。在兩個GUI知識基準與三個任務完成基準上的廣泛實驗表明,GUI-CIDER 能持續提升智能體對GUI操作的理解及其任務成功率。程式碼已公開於 https://github.com/Wuzheng02/GUI-CIDER。
English
Despite the rapid progress of multimodal large language models in building Graphical User Interface (GUI) agents, their real-world task completion is fundamentally bottlenecked by a lack of world knowledge about GUI operations. Existing solutions typically rely on expensive multi-agent scaffolding or conventional post-training paradigms, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). However, post-training only allows agents to implicitly absorb world knowledge through action annotations or reward signals, leading to inefficient trajectory memorization rather than genuine comprehension. Therefore, an approach that enables explicit learning of this knowledge is imperative. To this end, we propose GUI-CIDER, a mid-training method that explicitly internalizes GUI world knowledge through Causal Internalization and Density-aware Exemplar Reselection. GUI-CIDER operates in three stages: (1) data synthesis, which distills static planning and dynamic causal knowledge from GUI trajectories into text; (2) exemplar reselection, which filters the corpus by rewarding causal structures and penalizing semantic redundancy; and (3) mid-training, where the refined data is used to embed the acquired knowledge. Extensive experiments on two GUI knowledge benchmarks and three task completion benchmarks demonstrate that GUI-CIDER consistently improves both the agent's understanding of GUI operations and its task success rates.The codes are available at https://github.com/Wuzheng02/GUI-CIDER.