GUI-Libra: 액션 인식 감독과 부분적 검증 가능 RL을 통해 추론 및 행동이 가능한 네이티브 GUI 에이전트 학습

초록

오픈소스 기반의 네이티브 GUI 에이전트는 장기적 탐색 과제에서 여전히 폐쇄형 시스템에 뒤처지고 있습니다. 이러한 격차는 두 가지 한계에서 비롯됩니다: 고품질의 액션 정합 추론 데이터의 부족, 그리고 GUI 에이전트의 고유한 난제를 간과한 범용 사후 학습 파이프라인의 직접적 도입. 우리는 이러한 파이프라인에서 두 가지 근본적인 문제를 확인했습니다: (i) CoT 추론을 활용한 표준 SFT는 종종 실세계 연계성을 해치며, (ii) 단계별 RLVR 방식의 학습은 부분 검증 가능성에 직면하는데, 여러 액션이 정답일 수 있지만 단일 시범 액션만 검증에 사용됩니다. 이로 인해 오프라인 단계별 지표는 온라인 과제 성공률을 약하게 예측합니다. 본 연구에서는 이러한 과제를 해결하는 맞춤형 학습 방법론인 GUI-Libra를 제시합니다. 첫째, 액션 정합 추론 데이터의 부족 문제를 완화하기 위해 데이터 구축 및 필터링 파이프라인을 도입하고, 정제된 81K GUI 추론 데이터셋을 공개합니다. 둘째, 추론과 실세계 연계성을 조화시키기 위해 추론-후-액션 데이터와 직접-액션 데이터를 혼합하고, 액션 및 실세계 연계성 토큰의 중요도를 재조정하는 액션 인식 SFT를 제안합니다. 췯째, 부분 검증 가능성 하에서 RL을 안정화하기 위해 RLVR에서 간과된 KL 정규화의 중요성을 확인하고, KL 신뢰 영역이 오프라인-온라인 예측 가능성 향상에 중요함을 보이며, 더 나아가 신뢰할 수 없는 부정적 기울기의 가중치를 줄이기 위한 성공 적응형 스케일링을 도입합니다. 다양한 웹 및 모바일 벤치마크에서 GUI-Libra는 단계별 정확도와 종단간 과제 완료율을 모두 지속적으로 향상시켰습니다. 우리의 결과는 신중하게 설계된 사후 학습과 데이터 큐레이션이 비용이 많이 드는 온라인 데이터 수집 없이도 상당히 강력한 과제 해결 능력을 끌어낼 수 있음을 시사합니다. 추론 능력을 갖춘 GUI 에이전트를 위한 데이터 효율적 사후 학습 연구의 발전을 위해 데이터셋, 코드 및 모델을 공개합니다.

English

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in these pipelines: (i) standard SFT with CoT reasoning often hurts grounding, and (ii) step-wise RLVR-tyle training faces partial verifiability, where multiple actions can be correct but only a single demonstrated action is used for verification. This makes offline step-wise metrics weak predictors of online task success. In this work, we present GUI-Libra, a tailored training recipe that addresses these challenges. First, to mitigate the scarcity of action-aligned reasoning data, we introduce a data construction and filtering pipeline and release a curated 81K GUI reasoning dataset. Second, to reconcile reasoning with grounding, we propose action-aware SFT that mixes reasoning-then-action and direct-action data and reweights tokens to emphasize action and grounding. Third, to stabilize RL under partial verifiability, we identify the overlooked importance of KL regularization in RLVR and show that a KL trust region is critical for improving offline-to-online predictability; we further introduce success-adaptive scaling to downweight unreliable negative gradients. Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion. Our results suggest that carefully designed post-training and data curation can unlock significantly stronger task-solving capabilities without costly online data collection. We release our dataset, code, and models to facilitate further research on data-efficient post-training for reasoning-capable GUI agents.

GUI-Libra: 액션 인식 감독과 부분적 검증 가능 RL을 통해 추론 및 행동이 가능한 네이티브 GUI 에이전트 학습

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

초록

Support