BTL-UI：面向图形用户界面代理的眨眼-思考-链接推理模型

摘要

在AI驱动的人机图形界面交互自动化领域，尽管多模态大语言模型和强化微调技术的快速发展已取得显著进展，但一个根本性挑战依然存在：其交互逻辑与自然的人机图形界面沟通模式存在显著偏差。为填补这一空白，我们提出了“眨眼-思考-链接”（Blink-Think-Link, BTL）框架，这是一个模仿人类与图形界面间认知过程的脑启发式人机交互框架。该系统将交互分解为三个生物学上合理的阶段：(1) 眨眼——快速检测并关注屏幕相关区域，类似于眼球的扫视运动；(2) 思考——进行更高层次的推理与决策，反映认知规划过程；(3) 链接——生成可执行命令以实现精确的运动控制，模拟人类动作选择机制。此外，我们为BTL框架引入了两项关键技术革新：(1) 眨眼数据生成——专门为眨眼数据优化的自动化标注流程，以及(2) BTL奖励——首个基于规则的奖励机制，支持过程与结果双驱动的强化学习。基于此框架，我们开发了名为BTL-UI的图形界面代理模型，在综合基准测试中，无论是静态图形界面理解还是动态交互任务，均展现出持续领先的性能。这些结果有力实证了该框架在开发高级图形界面代理方面的有效性。

English

In the field of AI-driven human-GUI interaction automation, while rapid advances in multimodal large language models and reinforcement fine-tuning techniques have yielded remarkable progress, a fundamental challenge persists: their interaction logic significantly deviates from natural human-GUI communication patterns. To fill this gap, we propose "Blink-Think-Link" (BTL), a brain-inspired framework for human-GUI interaction that mimics the human cognitive process between users and graphical interfaces. The system decomposes interactions into three biologically plausible phases: (1) Blink - rapid detection and attention to relevant screen areas, analogous to saccadic eye movements; (2) Think - higher-level reasoning and decision-making, mirroring cognitive planning; and (3) Link - generation of executable commands for precise motor control, emulating human action selection mechanisms. Additionally, we introduce two key technical innovations for the BTL framework: (1) Blink Data Generation - an automated annotation pipeline specifically optimized for blink data, and (2) BTL Reward -- the first rule-based reward mechanism that enables reinforcement learning driven by both process and outcome. Building upon this framework, we develop a GUI agent model named BTL-UI, which demonstrates consistent state-of-the-art performance across both static GUI understanding and dynamic interaction tasks in comprehensive benchmarks. These results provide conclusive empirical validation of the framework's efficacy in developing advanced GUI Agents.

BTL-UI：面向图形用户界面代理的眨眼-思考-链接推理模型

BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent

摘要

Support