GUI-Libra: ネイティブGUIエージェントの訓練 - 行動認識型監督と部分検証可能な強化学習による推論と実行

要旨

オープンソースのネイティブGUIエージェントは、長期的なナビゲーションタスクにおいて依然としてクローズドソースシステムに遅れを取っている。この格差は2つの制約に起因する：高品質でアクションに整合した推論データの不足、およびGUIエージェント特有の課題を見落とした汎用的なポストトレーニングパイプラインの直接採用である。我々はこれらのパイプラインにおける2つの根本的な問題を特定した：(i) 標準的なCoT推論を用いたSFTはグラウンディングを損ないがちであり、(ii) 段階的なRLVR型トレーニングは部分検証可能性の問題に直面する。すなわち、複数のアクションが正解となり得るにもかかわらず、検証には単一の実証アクションのみが使用されるため、オフラインの段階的指標はオンラインタスク成功率の弱い予測因子となる。本論文では、これらの課題に対処するために調整されたトレーニング手法であるGUI-Libraを提案する。まず、アクション整合型推論データの不足を緩和するため、データ構築とフィルタリングのパイプラインを導入し、精選された81KのGUI推論データセットを公開する。次に、推論とグラウンディングの調和を図るため、推論後アクションと直接アクションのデータを混合し、アクションとグラウンディングを強調するトークン再重み付けを行うアクション認識SFTを提案する。第三に、部分検証可能性下でのRLを安定化させるため、RLVRにおけるKL正則化の見過ごされていた重要性を特定し、KL信頼領域がオフラインからオンラインへの予測可能性向上に重要であることを示す。さらに、信頼性の低い負の勾配を重み付け減衰する成功適応スケーリングを導入する。多様なWebおよびモバイルベンチマークにおいて、GUI-Libraは段階的精度とエンドツーエンドのタスク完了率の両方を一貫して改善する。我々の結果は、注意深く設計されたポストトレーニングとデータ精選により、高価なオンラインデータ収集なしに、大幅に強力なタスク解決能力を解放できることを示唆する。データセット、コード、モデルを公開し、推論能力を持つGUIエージェントのためのデータ効率的なポストトレーニング研究の促進を図る。

English

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in these pipelines: (i) standard SFT with CoT reasoning often hurts grounding, and (ii) step-wise RLVR-tyle training faces partial verifiability, where multiple actions can be correct but only a single demonstrated action is used for verification. This makes offline step-wise metrics weak predictors of online task success. In this work, we present GUI-Libra, a tailored training recipe that addresses these challenges. First, to mitigate the scarcity of action-aligned reasoning data, we introduce a data construction and filtering pipeline and release a curated 81K GUI reasoning dataset. Second, to reconcile reasoning with grounding, we propose action-aware SFT that mixes reasoning-then-action and direct-action data and reweights tokens to emphasize action and grounding. Third, to stabilize RL under partial verifiability, we identify the overlooked importance of KL regularization in RLVR and show that a KL trust region is critical for improving offline-to-online predictability; we further introduce success-adaptive scaling to downweight unreliable negative gradients. Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion. Our results suggest that carefully designed post-training and data curation can unlock significantly stronger task-solving capabilities without costly online data collection. We release our dataset, code, and models to facilitate further research on data-efficient post-training for reasoning-capable GUI agents.

GUI-Libra: ネイティブGUIエージェントの訓練 - 行動認識型監督と部分検証可能な強化学習による推論と実行

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

要旨

Support