UI-S1:通过半在线强化学习推进图形用户界面自动化
UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning
September 15, 2025
作者: Zhengxi Lu, Jiabo Ye, Fei Tang, Yongliang Shen, Haiyang Xu, Ziwei Zheng, Weiming Lu, Ming Yan, Fei Huang, Jun Xiao, Yueting Zhuang
cs.AI
摘要
图形用户界面(GUI)代理在通过强化学习自动化复杂用户界面交互方面取得了显著进展。然而,当前方法面临一个根本性困境:离线强化学习(RL)能够在预收集的轨迹上进行稳定训练,但由于缺乏轨迹级奖励信号,难以执行多步任务;在线RL通过环境交互捕捉这些信号,却受限于稀疏奖励和高昂的部署成本。为解决这一问题,我们提出了半在线强化学习(Semi-online Reinforcement Learning),这一新范式在离线轨迹上模拟在线RL。在每次rollout过程中,我们在多轮对话中保留原始模型输出,其中补丁模块自适应地恢复rollout与专家轨迹之间的偏差。为捕捉长期训练信号,半在线RL将折扣未来回报引入奖励计算,并通过加权步级和回合级优势优化策略。我们进一步引入半在线性能(SOP)指标,该指标与真实在线性能更契合,作为现实世界评估的实用且有效的代理。实验表明,我们的半在线RL在四个动态基准测试中,在7B模型间实现了SOTA性能,相较于基础模型有显著提升(例如,在AndroidWorld上+12.0%,在AITW上+23.8%),在缩小离线训练效率与在线多轮推理之间的差距方面取得了重大进展。代码已发布于https://github.com/X-PLUG/MobileAgent/tree/main/UI-S1。
English
Graphical User Interface (GUI) agents have demonstrated remarkable progress
in automating complex user interface interactions through reinforcement
learning. However, current approaches face a fundamental dilemma: offline RL
enables stable training on pre-collected trajectories, but struggles with
multi-step task execution for lack of trajectory-level reward signals; online
RL captures these signals through environment interaction, but suffers from
sparse rewards and prohibitive deployment costs. To address it, we present
Semi-online Reinforcement Learning, a novel paradigm that simulates online RL
on offline trajectories. During each rollout process, we preserve the original
model output within the multi-turn dialogue, where a Patch Module adaptively
recovers the divergence between rollout and expert trajectories. To capture
long-term training signals, Semi-online RL introduces discounted future returns
into the reward computation and optimizes the policy with weighted step-level
and episode-level advantages. We further introduce Semi-Online Performance
(SOP), a metric that aligns better with true online performance, serving as a
practical and effective proxy for real-world evaluation. Experiments show that
ours Semi-online RL achieves SOTA performance among 7B models across four
dynamic benchmarks, with significant gains over the base model (e.g., +12.0% on
AndroidWorld, +23.8% on AITW), demonstrating significant progress in bridging
the gap between offline training efficiency and online multi-turn reasoning.
The code is available at https://github.com/X-PLUG/MobileAgent/tree/main/UI-S1.