UI-AGILE：通过高效强化学习与精准推理时接地技术推进图形用户界面代理发展

摘要

多模态大型語言模型（MLLMs）的出現，顯著推動了圖形用戶界面（GUI）代理能力的進步。然而，現有的GUI代理訓練與推理技術仍面臨著推理設計困境、獎勵機制無效及視覺噪聲等問題。為解決這些問題，我們引入了UI-AGILE，這是一個在訓練與推理階段均能提升GUI代理性能的綜合框架。在訓練方面，我們提出了一系列改進監督微調（SFT）過程的方法：1）引入連續獎勵函數以激勵高精度定位；2）設置“簡化思維”獎勵，以平衡規劃速度與定位準確性；3）採用基於裁剪的重採樣策略，緩解稀疏獎勵問題，並提升在複雜任務上的學習效果。在推理階段，我們提出了分解選擇定位法，這是一種通過將圖像分割成更小、更易管理的部分，從而大幅提高在高分辨率顯示器上定位準確性的新方法。實驗表明，UI-AGILE在ScreenSpot-Pro和ScreenSpot-v2兩個基準測試上達到了最先進的性能。例如，結合我們提出的訓練與推理增強方法，在ScreenSpot-Pro上相比最佳基線帶來了23%的定位準確性提升。

English

The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE, a comprehensive framework enhancing GUI agents at both the training and inference stages. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process: 1) a Continuous Reward function to incentivize high-precision grounding; 2) a "Simple Thinking" reward to balance planning with speed and grounding accuracy; and 3) a Cropping-based Resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present Decomposed Grounding with Selection, a novel method that dramatically improves grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2. For instance, using both our proposed training and inference enhancement methods brings 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro.

UI-AGILE：通过高效强化学习与精准推理时接地技术推进图形用户界面代理发展

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

摘要

Support