GUI-CIDER: 인과적 내재화 및 밀도 인식 예제 재선택을 통한 GUI 에이전트 중간 훈련

초록

멀티모달 대규모 언어 모델이 GUI(Graphical User Interface) 에이전트 구축 측면에서 빠르게 발전하고 있음에도 불구하고, 실제 작업 완료 능력은 근본적으로 GUI 작업에 관한 세계 지식(World Knowledge)의 부족에 의해 병목 현상을 겪고 있다. 기존 해결책은 일반적으로 고비용의 다중 에이전트 프레임워크나 지도 미세 조정(SFT) 및 강화 학습(RL)과 같은 전통적인 사후 훈련(Post-training) 패러다임에 의존한다. 그러나 사후 훈련은 에이전트가 행동 주석이나 보상 신호를 통해 세계 지식을 암묵적으로 흡수하도록 허용할 뿐이며, 이는 진정한 이해보다는 비효율적인 궤적 암기에 그치게 한다. 따라서 이러한 지식의 명시적 학습을 가능하게 하는 접근 방식이 필수적이다. 이를 위해, 우리는 GUI-CIDER를 제안한다. 이는 인과적 내면화(Causal Internalization)와 밀도 기반 예제 재선택(Density-aware Exemplar Reselection)을 통해 GUI 세계 지식을 명시적으로 내재화하는 중간 훈련(Mid-training) 방법이다. GUI-CIDER는 세 단계로 작동한다: (1) 데이터 합성 단계로, GUI 궤적으로부터 정적 계획 및 동적 인과 지식을 텍스트로 추출(distill)한다; (2) 예제 재선택 단계로, 인과 구조에 보상을 부여하고 의미적 중복성을 패널티를 주어 말뭉치를 필터링한다; (3) 중간 훈련 단계로, 정제된 데이터를 사용하여 획득된 지식을 임베딩한다. 두 가지 GUI 지식 벤치마크와 세 가지 작업 완료 벤치마크에 대한 광범위한 실험 결과, GUI-CIDER가 에이전트의 GUI 작업 이해도와 작업 성공률을 일관되게 향상시킴을 보여준다. 코드는 https://github.com/Wuzheng02/GUI-CIDER 에서 확인할 수 있다.

English

Despite the rapid progress of multimodal large language models in building Graphical User Interface (GUI) agents, their real-world task completion is fundamentally bottlenecked by a lack of world knowledge about GUI operations. Existing solutions typically rely on expensive multi-agent scaffolding or conventional post-training paradigms, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). However, post-training only allows agents to implicitly absorb world knowledge through action annotations or reward signals, leading to inefficient trajectory memorization rather than genuine comprehension. Therefore, an approach that enables explicit learning of this knowledge is imperative. To this end, we propose GUI-CIDER, a mid-training method that explicitly internalizes GUI world knowledge through Causal Internalization and Density-aware Exemplar Reselection. GUI-CIDER operates in three stages: (1) data synthesis, which distills static planning and dynamic causal knowledge from GUI trajectories into text; (2) exemplar reselection, which filters the corpus by rewarding causal structures and penalizing semantic redundancy; and (3) mid-training, where the refined data is used to embed the acquired knowledge. Extensive experiments on two GUI knowledge benchmarks and three task completion benchmarks demonstrate that GUI-CIDER consistently improves both the agent's understanding of GUI operations and its task success rates.The codes are available at https://github.com/Wuzheng02/GUI-CIDER.