BTL-UI:面向图形用户界面代理的眨眼-思考-链接推理模型
BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent
September 19, 2025
作者: Shaojie Zhang, Ruoceng Zhang, Pei Fu, Shaokang Wang, Jiahui Yang, Xin Du, Shiqi Cui, Bin Qin, Ying Huang, Zhenbo Luo, Jian Luan
cs.AI
摘要
在人工智能驱动的人机图形用户界面(GUI)交互自动化领域,尽管多模态大语言模型与强化微调技术的飞速发展已取得了显著进展,但一个根本性挑战依然存在:其交互逻辑与自然的人机GUI沟通模式存在显著偏差。为填补这一空白,我们提出了“眨眼-思考-链接”(Blink-Think-Link, BTL)框架,这是一个受大脑启发的、模拟用户与图形界面间人类认知过程的交互框架。该系统将交互分解为三个生物学上可信的阶段:(1) 眨眼——快速检测并关注屏幕相关区域,类似于眼球的扫视运动;(2) 思考——进行高层级的推理与决策,映射认知规划过程;(3) 链接——生成可执行命令以实现精确的运动控制,模仿人类动作选择机制。此外,我们为BTL框架引入了两项关键技术革新:(1) 眨眼数据生成——专为眨眼数据优化的自动化标注流程,以及(2) BTL奖励——首个基于规则的奖励机制,支持由过程与结果共同驱动的强化学习。基于此框架,我们开发了一款名为BTL-UI的GUI代理模型,在综合基准测试中,无论是静态GUI理解还是动态交互任务,均展现出持续领先的性能。这些结果有力地实证了该框架在开发高级GUI代理方面的有效性。
English
In the field of AI-driven human-GUI interaction automation, while rapid
advances in multimodal large language models and reinforcement fine-tuning
techniques have yielded remarkable progress, a fundamental challenge persists:
their interaction logic significantly deviates from natural human-GUI
communication patterns. To fill this gap, we propose "Blink-Think-Link" (BTL),
a brain-inspired framework for human-GUI interaction that mimics the human
cognitive process between users and graphical interfaces. The system decomposes
interactions into three biologically plausible phases: (1) Blink - rapid
detection and attention to relevant screen areas, analogous to saccadic eye
movements; (2) Think - higher-level reasoning and decision-making, mirroring
cognitive planning; and (3) Link - generation of executable commands for
precise motor control, emulating human action selection mechanisms.
Additionally, we introduce two key technical innovations for the BTL framework:
(1) Blink Data Generation - an automated annotation pipeline specifically
optimized for blink data, and (2) BTL Reward -- the first rule-based reward
mechanism that enables reinforcement learning driven by both process and
outcome. Building upon this framework, we develop a GUI agent model named
BTL-UI, which demonstrates consistent state-of-the-art performance across both
static GUI understanding and dynamic interaction tasks in comprehensive
benchmarks. These results provide conclusive empirical validation of the
framework's efficacy in developing advanced GUI Agents.