BTL-UI: GUIエージェントのためのBlink-Think-Link推論モデル

要旨

AI駆動の人間-GUIインタラクション自動化の分野において、マルチモーダル大規模言語モデルと強化学習の微調整技術の急速な進展が顕著な進歩をもたらしている一方で、根本的な課題が依然として存在する：それらのインタラクションロジックは、自然な人間-GUIコミュニケーションパターンから大きく逸脱している。このギャップを埋めるため、我々は「Blink-Think-Link」（BTL）を提案する。これは、ユーザーとグラフィカルインターフェースの間の人間の認知プロセスを模倣した、脳に着想を得た人間-GUIインタラクションのフレームワークである。このシステムは、インタラクションを以下の3つの生物学的に妥当な段階に分解する：(1) Blink - サッカード眼球運動に類似した、関連する画面領域の迅速な検出と注意、(2) Think - 認知計画を反映した高次レベルの推論と意思決定、(3) Link - 人間の行動選択メカニズムを模倣した、精密な運動制御のための実行可能なコマンドの生成。さらに、BTLフレームワークに対して2つの重要な技術的革新を導入する：(1) Blink Data Generation - ブリンクデータに特化して最適化された自動アノテーションパイプライン、(2) BTL Reward - プロセスと結果の両方に基づいて強化学習を可能にする初のルールベースの報酬メカニズム。このフレームワークを基盤として、BTL-UIというGUIエージェントモデルを開発し、包括的なベンチマークにおいて静的GUI理解と動的インタラクションタスクの両方で一貫して最先端の性能を実証した。これらの結果は、高度なGUIエージェントの開発における本フレームワークの有効性を決定的に実証するものである。

English

In the field of AI-driven human-GUI interaction automation, while rapid advances in multimodal large language models and reinforcement fine-tuning techniques have yielded remarkable progress, a fundamental challenge persists: their interaction logic significantly deviates from natural human-GUI communication patterns. To fill this gap, we propose "Blink-Think-Link" (BTL), a brain-inspired framework for human-GUI interaction that mimics the human cognitive process between users and graphical interfaces. The system decomposes interactions into three biologically plausible phases: (1) Blink - rapid detection and attention to relevant screen areas, analogous to saccadic eye movements; (2) Think - higher-level reasoning and decision-making, mirroring cognitive planning; and (3) Link - generation of executable commands for precise motor control, emulating human action selection mechanisms. Additionally, we introduce two key technical innovations for the BTL framework: (1) Blink Data Generation - an automated annotation pipeline specifically optimized for blink data, and (2) BTL Reward -- the first rule-based reward mechanism that enables reinforcement learning driven by both process and outcome. Building upon this framework, we develop a GUI agent model named BTL-UI, which demonstrates consistent state-of-the-art performance across both static GUI understanding and dynamic interaction tasks in comprehensive benchmarks. These results provide conclusive empirical validation of the framework's efficacy in developing advanced GUI Agents.

BTL-UI: GUIエージェントのためのBlink-Think-Link推論モデル

BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent

要旨

Support