InfiGUI-G1:通过自适应探索策略优化推进图形用户界面理解
InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization
August 7, 2025
作者: Yuhang Liu, Zeyu Liu, Shuanghe Zhu, Pengxiang Li, Congkai Xie, Jiasheng Wang, Xueyu Hu, Xiaotian Han, Jianbo Yuan, Xinyao Wang, Shengyu Zhang, Hongxia Yang, Fei Wu
cs.AI
摘要
多模态大语言模型(MLLMs)的兴起推动了基于纯视觉输入在图形用户界面(GUI)上操作的自主代理的发展。一个根本性挑战在于如何稳健地实现自然语言指令的定位。这需要精确的空间对齐,即准确确定每个元素的坐标位置,更重要的是,正确的语义对齐,即将指令与功能上合适的UI元素相匹配。尽管可验证奖励的强化学习(RLVR)在提升这些MLLMs的空间对齐方面已证明有效,但我们发现,探索效率低下成为语义对齐的瓶颈,阻碍了模型学习复杂的语义关联。为解决这一探索问题,我们提出了自适应探索策略优化(AEPO),一种新的策略优化框架。AEPO采用多答案生成策略以强制更广泛的探索,随后通过基于效率第一原理η=U/C推导出的理论支撑的自适应探索奖励(AER)函数进行引导。我们通过AEPO训练的模型InfiGUI-G1-3B和InfiGUI-G1-7B,在多个具有挑战性的GUI定位基准测试中确立了新的最先进成果,在旨在测试泛化能力和语义理解的基准上,相较于基础RLVR方法实现了高达9.0%的相对显著提升。相关资源可在https://github.com/InfiXAI/InfiGUI-G1获取。
English
The emergence of Multimodal Large Language Models (MLLMs) has propelled the
development of autonomous agents that operate on Graphical User Interfaces
(GUIs) using pure visual input. A fundamental challenge is robustly grounding
natural language instructions. This requires a precise spatial alignment, which
accurately locates the coordinates of each element, and, more critically, a
correct semantic alignment, which matches the instructions to the functionally
appropriate UI element. Although Reinforcement Learning with Verifiable Rewards
(RLVR) has proven to be effective at improving spatial alignment for these
MLLMs, we find that inefficient exploration bottlenecks semantic alignment,
which prevent models from learning difficult semantic associations. To address
this exploration problem, we present Adaptive Exploration Policy Optimization
(AEPO), a new policy optimization framework. AEPO employs a multi-answer
generation strategy to enforce broader exploration, which is then guided by a
theoretically grounded Adaptive Exploration Reward (AER) function derived from
first principles of efficiency eta=U/C. Our AEPO-trained models, InfiGUI-G1-3B
and InfiGUI-G1-7B, establish new state-of-the-art results across multiple
challenging GUI grounding benchmarks, achieving significant relative
improvements of up to 9.0% against the naive RLVR baseline on benchmarks
designed to test generalization and semantic understanding. Resources are
available at https://github.com/InfiXAI/InfiGUI-G1.