ChatPaper.aiChatPaper

VEM:基於價值環境模型的無環境探索式GUI代理訓練

VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model

February 26, 2025
作者: Jiani Zheng, Lu Wang, Fangkai Yang, Chaoyun Zhang, Lingrui Mei, Wenjie Yin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
cs.AI

摘要

訓練用於圖形用戶界面(GUI)代理的視覺-語言模型(VLMs)通過強化學習(RL)面臨關鍵挑戰:基於環境的RL需要昂貴的交互,而無環境方法則在分佈偏移和獎勵泛化上遇到困難。我們提出了一種無環境的RL框架,該框架通過利用預訓練的價值環境模型(VEM)將價值估計與策略優化解耦。VEM直接從離線數據中預測狀態-動作值,提煉出類似人類對GUI交互結果的先驗知識,而無需預測下一個狀態或環境反饋。這避免了錯誤的累積,並通過專注於語義推理(例如,此動作是否推進用戶目標?)增強了對UI變化的適應性。該框架分兩個階段運行:(1) 預訓練VEM以估計長期動作效用,以及(2) 使用凍結的VEM信號指導策略探索,實現佈局無關的GUI自動化。在Android-in-the-Wild基準測試中,VEM在離線和在線設置下均達到了最先進的性能,顯著優於無環境基線,並在不產生交互成本的情況下與基於環境的方法相匹敵。重要的是,VEM證明了語義感知的價值估計可以實現與在線訓練方法相當的性能。
English
Training Vision-Language Models (VLMs) for Graphical User Interfaces (GUI) agents via Reinforcement Learning (RL) faces critical challenges: environment-based RL requires costly interactions, while environment-free methods struggle with distribution shift and reward generalization. We propose an environment-free RL framework that decouples value estimation from policy optimization by leveraging a pretrained Value Environment Model (VEM). VEM predicts state-action values directly from offline data, distilling human-like priors about GUI interaction outcomes without requiring next-state prediction or environmental feedback. This avoids compounding errors and enhances resilience to UI changes by focusing on semantic reasoning (e.g., Does this action advance the user's goal?). The framework operates in two stages: (1) pretraining VEM to estimate long-term action utilities and (2) guiding policy exploration with frozen VEM signals, enabling layout-agnostic GUI automation. Evaluated on Android-in-the-Wild benchmarks, VEM achieves state-of-the-art performance in both offline and online settings, outperforming environment-free baselines significantly and matching environment-based approaches without interaction costs. Importantly, VEM demonstrates that semantic-aware value estimation can achieve comparable performance with online-trained methods.

Summary

AI-Generated Summary

PDF122February 27, 2025