ChatPaper.aiChatPaper

UI-S1:通過半線上強化學習推進圖形用戶界面自動化

UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning

September 15, 2025
作者: Zhengxi Lu, Jiabo Ye, Fei Tang, Yongliang Shen, Haiyang Xu, Ziwei Zheng, Weiming Lu, Ming Yan, Fei Huang, Jun Xiao, Yueting Zhuang
cs.AI

摘要

圖形用戶界面(GUI)代理在通過強化學習自動化複雜用戶界面交互方面展現了顯著進展。然而,當前方法面臨一個根本性困境:離線強化學習能夠在預先收集的軌跡上進行穩定訓練,但由於缺乏軌跡級別的獎勵信號,在多步驟任務執行上表現不佳;在線強化學習通過環境交互捕捉這些信號,卻因獎勵稀疏和部署成本高昂而受限。為解決這一問題,我們提出了半在線強化學習,這是一種在離線軌跡上模擬在線強化學習的新範式。在每次滾動過程中,我們在多輪對話中保留原始模型輸出,其中補丁模塊自適應地恢復滾動軌跡與專家軌跡之間的分歧。為捕捉長期訓練信號,半在線強化學習將折現未來回報引入獎勵計算,並通過加權的步驟級別和回合級別優勢來優化策略。我們進一步引入了半在線性能(SOP),這一指標能更好地與真實在線性能對齊,作為現實世界評估的實用且有效的代理。實驗表明,我們的半在線強化學習在四個動態基準測試中,於7B模型中達到了SOTA性能,相較於基礎模型有顯著提升(例如,在AndroidWorld上提升12.0%,在AITW上提升23.8%),在縮小離線訓練效率與在線多輪推理之間的差距方面取得了重大進展。代碼已公開於https://github.com/X-PLUG/MobileAgent/tree/main/UI-S1。
English
Graphical User Interface (GUI) agents have demonstrated remarkable progress in automating complex user interface interactions through reinforcement learning. However, current approaches face a fundamental dilemma: offline RL enables stable training on pre-collected trajectories, but struggles with multi-step task execution for lack of trajectory-level reward signals; online RL captures these signals through environment interaction, but suffers from sparse rewards and prohibitive deployment costs. To address it, we present Semi-online Reinforcement Learning, a novel paradigm that simulates online RL on offline trajectories. During each rollout process, we preserve the original model output within the multi-turn dialogue, where a Patch Module adaptively recovers the divergence between rollout and expert trajectories. To capture long-term training signals, Semi-online RL introduces discounted future returns into the reward computation and optimizes the policy with weighted step-level and episode-level advantages. We further introduce Semi-Online Performance (SOP), a metric that aligns better with true online performance, serving as a practical and effective proxy for real-world evaluation. Experiments show that ours Semi-online RL achieves SOTA performance among 7B models across four dynamic benchmarks, with significant gains over the base model (e.g., +12.0% on AndroidWorld, +23.8% on AITW), demonstrating significant progress in bridging the gap between offline training efficiency and online multi-turn reasoning. The code is available at https://github.com/X-PLUG/MobileAgent/tree/main/UI-S1.
PDF473September 16, 2025