ChatPaper.aiChatPaper

GUI-Libra:訓練原生GUI代理程式透過動作感知監督與部分可驗證強化學習進行推理與行動

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

February 25, 2026
作者: Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baoling Peng, Huan Zhang, Jianfeng Gao, Tong Zhang
cs.AI

摘要

開源原生圖形使用者介面智慧體在長時程導航任務上仍落後於閉源系統。此差距源於兩項限制:高品質動作對齊推理資料的匱乏,以及直接套用通用後訓練流程卻忽略了圖形使用者介面智慧體的特殊挑戰。我們發現這些流程存在兩個根本問題:(i)採用思維鏈推理的標準監督微調往往損害基礎定位能力;(ii)逐步強化學習與可驗證推理式訓練面臨部分可驗證性困境——多個動作可能皆屬正確,但驗證時僅採用單一示範動作。這導致離線逐步評估指標難以有效預測線上任務成功率。本研究提出專為圖形使用者介面設計的訓練方案GUI-Libra以應對這些挑戰。首先,為緩解動作對齊推理資料稀缺問題,我們建構了資料生成與篩選流程,並發布精選的8.1萬筆圖形使用者介面推理資料集。其次,為協調推理與基礎定位,我們提出動作感知監督微調,融合「先推理後動作」與直接動作資料,並透過權重重分配強化動作與基礎定位標記。第三,針對部分可驗證性下的強化學習穩定性問題,我們揭示強化學習與可驗證推理中KL正則化被忽視的重要性,證明KL信賴區域對提升離線至線上預測力至關重要;更進一步提出成功自適應縮放機制,以降低不可靠負梯度權重。在多樣化的網頁與行動裝置基準測試中,GUI-Libra持續提升逐步準確率與端到端任務完成度。實驗結果表明,精心設計的後訓練與資料策展能顯著釋放任務解決能力,無需耗費成本的線上資料收集。我們公開資料集、程式碼與模型,以促進具推理能力圖形使用者介面智慧體之資料高效後訓練研究。
English
Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in these pipelines: (i) standard SFT with CoT reasoning often hurts grounding, and (ii) step-wise RLVR-tyle training faces partial verifiability, where multiple actions can be correct but only a single demonstrated action is used for verification. This makes offline step-wise metrics weak predictors of online task success. In this work, we present GUI-Libra, a tailored training recipe that addresses these challenges. First, to mitigate the scarcity of action-aligned reasoning data, we introduce a data construction and filtering pipeline and release a curated 81K GUI reasoning dataset. Second, to reconcile reasoning with grounding, we propose action-aware SFT that mixes reasoning-then-action and direct-action data and reweights tokens to emphasize action and grounding. Third, to stabilize RL under partial verifiability, we identify the overlooked importance of KL regularization in RLVR and show that a KL trust region is critical for improving offline-to-online predictability; we further introduce success-adaptive scaling to downweight unreliable negative gradients. Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion. Our results suggest that carefully designed post-training and data curation can unlock significantly stronger task-solving capabilities without costly online data collection. We release our dataset, code, and models to facilitate further research on data-efficient post-training for reasoning-capable GUI agents.
PDF122February 27, 2026