UI-AGILE: 効果的な強化学習と精密な推論時グラウンディングによるGUIエージェントの進化

要旨

マルチモーダル大規模言語モデル（MLLMs）の出現により、グラフィカルユーザーインターフェース（GUI）エージェントの能力が大幅に向上した。しかしながら、既存のGUIエージェントの訓練および推論技術は、推論設計のジレンマ、報酬の非効率性、視覚的ノイズといった課題に依然として直面している。これらの問題に対処するため、我々はUI-AGILEを提案する。これは、訓練および推論の両段階においてGUIエージェントを強化する包括的なフレームワークである。訓練段階では、教師あり微調整（SFT）プロセスに対する一連の改善を提案する：1）高精度なグラウンディングを促進するための連続報酬関数、2）計画性と速度およびグラウンディング精度のバランスを取るための「シンプル思考」報酬、3）複雑なタスクにおける学習を改善し、報酬の希薄化問題を緩和するためのクロッピングベースの再サンプリング戦略。推論段階では、高解像度ディスプレイ上でのグラウンディング精度を大幅に向上させるために、画像を小さな管理可能な部分に分解する新規手法「分解グラウンディングと選択」を提示する。実験結果は、UI-AGILEがScreenSpot-ProおよびScreenSpot-v2の2つのベンチマークにおいて最先端の性能を達成することを示している。例えば、提案した訓練および推論の強化手法を併用することで、ScreenSpot-Proにおいて最良のベースラインと比較して23%のグラウンディング精度の向上が得られた。

English

The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE, a comprehensive framework enhancing GUI agents at both the training and inference stages. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process: 1) a Continuous Reward function to incentivize high-precision grounding; 2) a "Simple Thinking" reward to balance planning with speed and grounding accuracy; and 3) a Cropping-based Resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present Decomposed Grounding with Selection, a novel method that dramatically improves grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2. For instance, using both our proposed training and inference enhancement methods brings 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro.