VEM: 가치 환경 모델을 활용한 GUI 에이전트 훈련을 위한 환경 독립적 탐색

초록

그래픽 사용자 인터페이스(GUI) 에이전트를 위한 비전-언어 모델(VLM)을 강화 학습(RL)을 통해 훈련시키는 데는 중요한 과제가 있습니다: 환경 기반 RL은 비용이 많이 드는 상호작용을 요구하는 반면, 환경 없는 방법론은 분포 변화와 보상 일반화에 어려움을 겪습니다. 우리는 사전 훈련된 가치 환경 모델(VEM)을 활용하여 가치 추정과 정책 최적화를 분리하는 환경 없는 RL 프레임워크를 제안합니다. VEM은 오프라인 데이터에서 직접 상태-행동 가치를 예측하며, 다음 상태 예측이나 환경 피드백 없이도 GUI 상호작용 결과에 대한 인간과 유사한 사전 지식을 추출합니다. 이는 오류 누적을 피하고 의미론적 추론(예: 이 행동이 사용자의 목표를 진전시키는가?)에 초점을 맞춤으로써 UI 변화에 대한 탄력성을 향상시킵니다. 이 프레임워크는 두 단계로 작동합니다: (1) 장기적 행동 효용을 추정하기 위해 VEM을 사전 훈련시키고, (2) 고정된 VEM 신호로 정책 탐색을 안내하여 레이아웃에 구애받지 않는 GUI 자동화를 가능하게 합니다. Android-in-the-Wild 벤치마크에서 평가한 결과, VEM은 오프라인 및 온라인 설정 모두에서 최신 기술 수준의 성능을 달성하며, 환경 없는 기준선을 크게 능가하고 상호작용 비용 없이 환경 기반 접근법과 동등한 성능을 보였습니다. 특히, VEM은 의미론적 인식을 통한 가치 추정이 온라인 훈련 방법과 견줄 만한 성능을 달성할 수 있음을 입증했습니다.

English

Training Vision-Language Models (VLMs) for Graphical User Interfaces (GUI) agents via Reinforcement Learning (RL) faces critical challenges: environment-based RL requires costly interactions, while environment-free methods struggle with distribution shift and reward generalization. We propose an environment-free RL framework that decouples value estimation from policy optimization by leveraging a pretrained Value Environment Model (VEM). VEM predicts state-action values directly from offline data, distilling human-like priors about GUI interaction outcomes without requiring next-state prediction or environmental feedback. This avoids compounding errors and enhances resilience to UI changes by focusing on semantic reasoning (e.g., Does this action advance the user's goal?). The framework operates in two stages: (1) pretraining VEM to estimate long-term action utilities and (2) guiding policy exploration with frozen VEM signals, enabling layout-agnostic GUI automation. Evaluated on Android-in-the-Wild benchmarks, VEM achieves state-of-the-art performance in both offline and online settings, outperforming environment-free baselines significantly and matching environment-based approaches without interaction costs. Importantly, VEM demonstrates that semantic-aware value estimation can achieve comparable performance with online-trained methods.

VEM: 가치 환경 모델을 활용한 GUI 에이전트 훈련을 위한 환경 독립적 탐색

VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model

초록

Support