BTL-UI: GUI 에이전트를 위한 Blink-Think-Link 추론 모델

초록

AI 기반 인간-GUI 상호작용 자동화 분야에서, 다중 모드 대형 언어 모델과 강화 미세 조정 기술의 급속한 발전이 놀라운 진전을 이루었음에도 불구하고, 근본적인 문제가 여전히 존재한다: 이들의 상호작용 논리가 자연스러운 인간-GUI 커뮤니케이션 패턴과 크게 벗어난다는 점이다. 이러한 격차를 메우기 위해, 우리는 인간의 인지 과정을 모방한 "Blink-Think-Link"(BTL)이라는 인간-GUI 상호작용을 위한 뇌 영감 프레임워크를 제안한다. 이 시스템은 상호작용을 생물학적으로 타당한 세 단계로 분해한다: (1) Blink - 사카딕 안구 운동과 유사하게 관련 화면 영역을 신속하게 탐지하고 주의를 기울이는 단계; (2) Think - 인지 계획을 반영하는 고차원적 추론 및 의사결정 단계; (3) Link - 인간의 행동 선택 메커니즘을 모방하여 정밀한 운동 제어를 위한 실행 가능한 명령을 생성하는 단계. 또한, 우리는 BTL 프레임워크를 위한 두 가지 핵심 기술 혁신을 소개한다: (1) Blink Data Generation - Blink 데이터에 특화된 자동 주석 파이프라인, 그리고 (2) BTL Reward - 과정과 결과 모두에 의해 강화 학습을 가능하게 하는 최초의 규칙 기반 보상 메커니즘. 이 프레임워크를 기반으로, 우리는 BTL-UI라는 GUI 에이전트 모델을 개발하였으며, 이 모델은 포괄적인 벤치마크에서 정적 GUI 이해와 동적 상호작용 작업 모두에서 일관된 최첨단 성능을 보여준다. 이러한 결과는 고급 GUI 에이전트 개발에 있어 이 프레임워크의 효능을 결정적으로 실증적으로 검증한다.

English

In the field of AI-driven human-GUI interaction automation, while rapid advances in multimodal large language models and reinforcement fine-tuning techniques have yielded remarkable progress, a fundamental challenge persists: their interaction logic significantly deviates from natural human-GUI communication patterns. To fill this gap, we propose "Blink-Think-Link" (BTL), a brain-inspired framework for human-GUI interaction that mimics the human cognitive process between users and graphical interfaces. The system decomposes interactions into three biologically plausible phases: (1) Blink - rapid detection and attention to relevant screen areas, analogous to saccadic eye movements; (2) Think - higher-level reasoning and decision-making, mirroring cognitive planning; and (3) Link - generation of executable commands for precise motor control, emulating human action selection mechanisms. Additionally, we introduce two key technical innovations for the BTL framework: (1) Blink Data Generation - an automated annotation pipeline specifically optimized for blink data, and (2) BTL Reward -- the first rule-based reward mechanism that enables reinforcement learning driven by both process and outcome. Building upon this framework, we develop a GUI agent model named BTL-UI, which demonstrates consistent state-of-the-art performance across both static GUI understanding and dynamic interaction tasks in comprehensive benchmarks. These results provide conclusive empirical validation of the framework's efficacy in developing advanced GUI Agents.

BTL-UI: GUI 에이전트를 위한 Blink-Think-Link 추론 모델

BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent

초록

Support