ChatPaper.aiChatPaper

InfiGUI-G1:基於自適應探索策略優化的圖形用戶界面定位技術進展

InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization

August 7, 2025
作者: Yuhang Liu, Zeyu Liu, Shuanghe Zhu, Pengxiang Li, Congkai Xie, Jiasheng Wang, Xueyu Hu, Xiaotian Han, Jianbo Yuan, Xinyao Wang, Shengyu Zhang, Hongxia Yang, Fei Wu
cs.AI

摘要

多模态大型語言模型(MLLMs)的出現,推動了基於純視覺輸入在圖形用戶界面(GUIs)上運作的自動代理的發展。其中一個根本性挑戰是穩健地將自然語言指令進行定位。這需要精確的空間對齊,即準確定位每個元素的座標,更關鍵的是正確的語義對齊,即將指令與功能上適當的UI元素相匹配。儘管帶有可驗證獎勵的強化學習(RLVR)已被證明在提升這些MLLMs的空間對齊方面有效,我們發現低效的探索阻礙了語義對齊,使得模型難以學習複雜的語義關聯。為解決這一探索問題,我們提出了自適應探索策略優化(AEPO),這是一種新的策略優化框架。AEPO採用多答案生成策略來強制更廣泛的探索,並由基於效率第一原理η=U/C推導出的理論基礎自適應探索獎勵(AER)函數進行指導。我們通過AEPO訓練的模型,InfiGUI-G1-3B和InfiGUI-G1-7B,在多個具有挑戰性的GUI定位基準測試中建立了新的最先進成果,相對於旨在測試泛化能力和語義理解的基準測試中的原始RLVR基線,實現了高達9.0%的顯著相對改進。相關資源可在https://github.com/InfiXAI/InfiGUI-G1獲取。
English
The emergence of Multimodal Large Language Models (MLLMs) has propelled the development of autonomous agents that operate on Graphical User Interfaces (GUIs) using pure visual input. A fundamental challenge is robustly grounding natural language instructions. This requires a precise spatial alignment, which accurately locates the coordinates of each element, and, more critically, a correct semantic alignment, which matches the instructions to the functionally appropriate UI element. Although Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be effective at improving spatial alignment for these MLLMs, we find that inefficient exploration bottlenecks semantic alignment, which prevent models from learning difficult semantic associations. To address this exploration problem, we present Adaptive Exploration Policy Optimization (AEPO), a new policy optimization framework. AEPO employs a multi-answer generation strategy to enforce broader exploration, which is then guided by a theoretically grounded Adaptive Exploration Reward (AER) function derived from first principles of efficiency eta=U/C. Our AEPO-trained models, InfiGUI-G1-3B and InfiGUI-G1-7B, establish new state-of-the-art results across multiple challenging GUI grounding benchmarks, achieving significant relative improvements of up to 9.0% against the naive RLVR baseline on benchmarks designed to test generalization and semantic understanding. Resources are available at https://github.com/InfiXAI/InfiGUI-G1.
PDF242August 11, 2025