BTL-UI:面向图形用户界面代理的眨眼-思考-链接推理模型
BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent
September 19, 2025
作者: Shaojie Zhang, Ruoceng Zhang, Pei Fu, Shaokang Wang, Jiahui Yang, Xin Du, Shiqi Cui, Bin Qin, Ying Huang, Zhenbo Luo, Jian Luan
cs.AI
摘要
在AI驱动的人机图形界面交互自动化领域,尽管多模态大语言模型和强化微调技术的快速发展已取得显著进展,但一个根本性挑战依然存在:其交互逻辑与自然的人机图形界面沟通模式存在显著偏差。为填补这一空白,我们提出了“眨眼-思考-链接”(Blink-Think-Link, BTL)框架,这是一个模仿人类与图形界面间认知过程的脑启发式人机交互框架。该系统将交互分解为三个生物学上合理的阶段:(1) 眨眼——快速检测并关注屏幕相关区域,类似于眼球的扫视运动;(2) 思考——进行更高层次的推理与决策,反映认知规划过程;(3) 链接——生成可执行命令以实现精确的运动控制,模拟人类动作选择机制。此外,我们为BTL框架引入了两项关键技术革新:(1) 眨眼数据生成——专门为眨眼数据优化的自动化标注流程,以及(2) BTL奖励——首个基于规则的奖励机制,支持过程与结果双驱动的强化学习。基于此框架,我们开发了名为BTL-UI的图形界面代理模型,在综合基准测试中,无论是静态图形界面理解还是动态交互任务,均展现出持续领先的性能。这些结果有力实证了该框架在开发高级图形界面代理方面的有效性。
English
In the field of AI-driven human-GUI interaction automation, while rapid
advances in multimodal large language models and reinforcement fine-tuning
techniques have yielded remarkable progress, a fundamental challenge persists:
their interaction logic significantly deviates from natural human-GUI
communication patterns. To fill this gap, we propose "Blink-Think-Link" (BTL),
a brain-inspired framework for human-GUI interaction that mimics the human
cognitive process between users and graphical interfaces. The system decomposes
interactions into three biologically plausible phases: (1) Blink - rapid
detection and attention to relevant screen areas, analogous to saccadic eye
movements; (2) Think - higher-level reasoning and decision-making, mirroring
cognitive planning; and (3) Link - generation of executable commands for
precise motor control, emulating human action selection mechanisms.
Additionally, we introduce two key technical innovations for the BTL framework:
(1) Blink Data Generation - an automated annotation pipeline specifically
optimized for blink data, and (2) BTL Reward -- the first rule-based reward
mechanism that enables reinforcement learning driven by both process and
outcome. Building upon this framework, we develop a GUI agent model named
BTL-UI, which demonstrates consistent state-of-the-art performance across both
static GUI understanding and dynamic interaction tasks in comprehensive
benchmarks. These results provide conclusive empirical validation of the
framework's efficacy in developing advanced GUI Agents.