UI-AGILE：通过高效强化学习与精准推理时定位技术推动GUI智能体发展

摘要

多模态大语言模型（MLLMs）的兴起显著推动了图形用户界面（GUI）代理能力的发展。然而，现有的GUI代理训练与推理技术仍面临推理设计困境、奖励机制低效及视觉噪声等问题。为解决这些挑战，我们提出了UI-AGILE，一个在训练与推理阶段全面提升GUI代理性能的综合框架。在训练方面，我们对监督微调（SFT）过程提出了一系列改进措施：1）引入连续奖励函数，以激励高精度的定位；2）采用“简单思考”奖励机制，在规划速度与定位准确性之间取得平衡；3）实施基于裁剪的重采样策略，缓解稀疏奖励问题，并提升复杂任务的学习效果。在推理阶段，我们提出了分解式定位选择法，通过将图像分割为更小、更易处理的部分，显著提高了在高分辨率显示屏上的定位精度。实验结果表明，UI-AGILE在ScreenSpot-Pro和ScreenSpot-v2两个基准测试中均达到了业界领先水平。例如，结合我们提出的训练与推理增强方法，在ScreenSpot-Pro上相较于最佳基线模型，定位准确率提升了23%。

English

The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE, a comprehensive framework enhancing GUI agents at both the training and inference stages. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process: 1) a Continuous Reward function to incentivize high-precision grounding; 2) a "Simple Thinking" reward to balance planning with speed and grounding accuracy; and 3) a Cropping-based Resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present Decomposed Grounding with Selection, a novel method that dramatically improves grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2. For instance, using both our proposed training and inference enhancement methods brings 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro.

UI-AGILE：通过高效强化学习与精准推理时定位技术推动GUI智能体发展

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

摘要

Support