GUI-KV：時空間認識を備えたKVキャッシュによる効率的なGUIエージェント

要旨

視覚言語モデルに基づくグラフィカルユーザーインターフェース（GUI）エージェントは、人間とコンピュータのワークフローを自動化する有望なアプローチとして登場している。しかし、高解像度のスクリーンショットの長いシーケンスを処理し、長期的なタスクを解決する際に非効率性の課題に直面しており、推論が遅く、コストがかかり、メモリに制約される。キー・バリュー（KV）キャッシュはこれを緩和できるが、画像が豊富なコンテキストでは完全なキャッシュを保存することが困難である。既存のキャッシュ圧縮方法は、GUIの空間的および時間的な冗長性を考慮していないため、最適ではない。本研究では、まずGUIエージェントのワークロードにおけるアテンションパターンを分析し、自然画像とは異なり、すべてのトランスフォーマーレイヤーでアテンションの疎性が一様に高いことを発見した。この洞察は、単純な均一な予算配分戦略を動機づけ、経験的に複雑なレイヤー変動スキームを上回ることを示す。これに基づいて、再トレーニングを必要としないプラグアンドプレイのKVキャッシュ圧縮方法であるGUI-KVを導入する。GUI-KVは、以下の2つの新技術を組み合わせている：(i) 空間的顕著性ガイダンス。これは、隠れ状態のL2ノルムをアテンションスコアに追加し、意味的に重要な視覚トークンをより良く保存する。(ii) 時間的冗長性スコアリング。これは、前フレームのキーを現在のフレームのキー部分空間に投影し、冗長な履歴を優先的に削除する。標準的なGUIエージェントのベンチマークとモデルにおいて、GUI-KVは競合するKV圧縮ベースラインを上回り、控えめな予算で完全キャッシュの精度に近い結果を示す。特に、AgentNetBenchベンチマークにおける5スクリーンショット設定では、GUI-KVはデコードFLOPを38.9%削減し、ステップ精度を4.1%向上させた。これらの結果は、GUI固有の冗長性を活用することで、効率的で信頼性の高いエージェント性能が可能であることを示している。

English

Graphical user interface (GUI) agents built on vision-language models have emerged as a promising approach to automate human-computer workflows. However, they also face the inefficiency challenge as they process long sequences of high-resolution screenshots and solving long-horizon tasks, making inference slow, costly and memory-bound. While key-value (KV) caching can mitigate this, storing the full cache is prohibitive for image-heavy contexts. Existing cache-compression methods are sub-optimal as they do not account for the spatial and temporal redundancy of GUIs. In this work, we first analyze attention patterns in GUI agent workloads and find that, unlike in natural images, attention sparsity is uniformly high across all transformer layers. This insight motivates a simple uniform budget allocation strategy, which we show empirically outperforms more complex layer-varying schemes. Building on this, we introduce GUI-KV, a plug-and-play KV cache compression method for GUI agents that requires no retraining. GUI-KV combines two novel techniques: (i) spatial saliency guidance, which augments attention scores with the L2 norm of hidden states to better preserve semantically important visual tokens, and (ii) temporal redundancy scoring, which projects previous frames' keys onto the current frame's key subspace to preferentially prune redundant history. Across standard GUI agent benchmarks and models, GUI-KV outperforms competitive KV compression baselines, closely matching full-cache accuracy at modest budgets. Notably, in a 5-screenshot setting on the AgentNetBench benchmark, GUI-KV reduces decoding FLOPs by 38.9% while increasing step accuracy by 4.1% over the full-cache baseline. These results demonstrate that exploiting GUI-specific redundancies enables efficient and reliable agent performance.

GUI-KV：時空間認識を備えたKVキャッシュによる効率的なGUIエージェント

GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness

要旨

Support