GUI-Libra：通过动作感知监督与部分可验证强化学习训练原生GUI智能体进行推理与行动

摘要

开源原生图形用户界面智能体在长周期导航任务上仍落后于闭源系统。这一差距源于两大局限：高质量动作对齐推理数据的匮乏，以及直接套用通用后训练流程而忽视了图形用户界面智能体的独特挑战。我们发现这些流程存在两个根本性问题：（一）采用思维链推理的标准监督微调往往会损害动作落地效果；（二）逐步强化学习与验证式训练面临部分可验证性困境——多个动作可能都正确，但验证时仅采用单个示范动作。这导致离线逐步评估指标难以有效预测在线任务成功率。本文提出GUI-Libra这一针对性训练方案应对上述挑战。首先，为缓解动作对齐推理数据短缺，我们设计了数据构建与过滤流程，并发布精心整理的8.1万条图形用户界面推理数据集。其次，为协调推理与动作落地，我们提出动作感知监督微调，混合"先推理后行动"与直接行动数据，并通过令牌重加权强化动作与落地要素。第三，针对部分可验证性下的强化学习稳定性问题，我们揭示了RLVR中KL正则化被忽视的重要性，证明KL信任区域对提升离线-在线预测性至关重要；进一步提出成功自适应缩放机制，降低不可靠负梯度的影响。在多样化网页与移动端测试中，GUI-Libra持续提升逐步准确率与端到端任务完成度。结果表明，精心设计的后训练与数据策管能在不依赖昂贵在线数据收集的情况下，显著解锁更强的任务解决能力。我们公开数据集、代码与模型，以推动具备推理能力的图形用户界面智能体在数据高效后训练方面的研究。

English

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in these pipelines: (i) standard SFT with CoT reasoning often hurts grounding, and (ii) step-wise RLVR-tyle training faces partial verifiability, where multiple actions can be correct but only a single demonstrated action is used for verification. This makes offline step-wise metrics weak predictors of online task success. In this work, we present GUI-Libra, a tailored training recipe that addresses these challenges. First, to mitigate the scarcity of action-aligned reasoning data, we introduce a data construction and filtering pipeline and release a curated 81K GUI reasoning dataset. Second, to reconcile reasoning with grounding, we propose action-aware SFT that mixes reasoning-then-action and direct-action data and reweights tokens to emphasize action and grounding. Third, to stabilize RL under partial verifiability, we identify the overlooked importance of KL regularization in RLVR and show that a KL trust region is critical for improving offline-to-online predictability; we further introduce success-adaptive scaling to downweight unreliable negative gradients. Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion. Our results suggest that carefully designed post-training and data curation can unlock significantly stronger task-solving capabilities without costly online data collection. We release our dataset, code, and models to facilitate further research on data-efficient post-training for reasoning-capable GUI agents.