GUI-KV：基於時空感知KV緩存的高效GUI代理

摘要

基於視覺語言模型構建的圖形用戶界面（GUI）代理，已成為自動化人機工作流程的一種前景廣闊的方法。然而，這些代理在處理長序列的高分辨率屏幕截圖及解決長時程任務時，也面臨著效率低下的挑戰，導致推理過程緩慢、成本高昂且受內存限制。雖然鍵值（KV）緩存技術能夠緩解這一問題，但在圖像密集的場景中，存儲完整的緩存數據仍顯得不切實際。現有的緩存壓縮方法因未充分考慮GUI的空間與時間冗余特性，其效果並不理想。本研究首先分析了GUI代理工作負載中的注意力模式，發現與自然圖像不同，所有Transformer層的注意力稀疏性均勻地處於較高水平。這一洞察啟發我們提出了一種簡單的均勻預算分配策略，並通過實驗證明其優於更為複雜的層間變化方案。基於此，我們引入了GUI-KV，這是一種無需重新訓練即可用於GUI代理的即插即用KV緩存壓縮方法。GUI-KV結合了兩項創新技術：（i）空間顯著性引導，通過將隱藏狀態的L2範數融入注意力分數，更好地保留語義上重要的視覺標記；（ii）時間冗余評分，將前一幀的鍵投影到當前幀的鍵子空間，優先修剪冗余歷史信息。在標準的GUI代理基準測試和模型上，GUI-KV超越了競爭性的KV壓縮基線，在適度的預算下緊密匹配了全緩存的準確性。特別是在AgentNetBench基準測試的5張截圖設置中，GUI-KV相比全緩存基線，解碼浮點運算次數減少了38.9%，同時步驟準確率提升了4.1%。這些結果表明，利用GUI特有的冗余特性，能夠實現高效且可靠的代理性能。

English

Graphical user interface (GUI) agents built on vision-language models have emerged as a promising approach to automate human-computer workflows. However, they also face the inefficiency challenge as they process long sequences of high-resolution screenshots and solving long-horizon tasks, making inference slow, costly and memory-bound. While key-value (KV) caching can mitigate this, storing the full cache is prohibitive for image-heavy contexts. Existing cache-compression methods are sub-optimal as they do not account for the spatial and temporal redundancy of GUIs. In this work, we first analyze attention patterns in GUI agent workloads and find that, unlike in natural images, attention sparsity is uniformly high across all transformer layers. This insight motivates a simple uniform budget allocation strategy, which we show empirically outperforms more complex layer-varying schemes. Building on this, we introduce GUI-KV, a plug-and-play KV cache compression method for GUI agents that requires no retraining. GUI-KV combines two novel techniques: (i) spatial saliency guidance, which augments attention scores with the L2 norm of hidden states to better preserve semantically important visual tokens, and (ii) temporal redundancy scoring, which projects previous frames' keys onto the current frame's key subspace to preferentially prune redundant history. Across standard GUI agent benchmarks and models, GUI-KV outperforms competitive KV compression baselines, closely matching full-cache accuracy at modest budgets. Notably, in a 5-screenshot setting on the AgentNetBench benchmark, GUI-KV reduces decoding FLOPs by 38.9% while increasing step accuracy by 4.1% over the full-cache baseline. These results demonstrate that exploiting GUI-specific redundancies enables efficient and reliable agent performance.

GUI-KV：基於時空感知KV緩存的高效GUI代理

GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness

摘要

Support