GUI-KV：基于时空感知KV缓存的高效GUI智能体

摘要

基于视觉语言模型构建的图形用户界面（GUI）代理已成为自动化人机工作流程的一种前景广阔的方法。然而，这些代理在处理高分辨率截图序列和解决长期任务时也面临着效率挑战，导致推理速度慢、成本高且受限于内存。虽然键值（KV）缓存可以缓解这一问题，但在图像密集的场景中存储完整的缓存是不可行的。现有的缓存压缩方法并不理想，因为它们没有考虑到GUI的空间和时间冗余性。在本研究中，我们首先分析了GUI代理工作负载中的注意力模式，发现与自然图像不同，所有Transformer层中的注意力稀疏性均较高。这一洞察促使我们提出了一种简单的统一预算分配策略，实验表明该策略优于更复杂的层间变化方案。在此基础上，我们引入了GUI-KV，一种无需重新训练的即插即用KV缓存压缩方法。GUI-KV结合了两项新技术：（i）空间显著性引导，通过隐藏状态的L2范数增强注意力分数，以更好地保留语义重要的视觉标记；（ii）时间冗余评分，将前一帧的键投影到当前帧的键子空间，优先剪除冗余历史。在标准GUI代理基准测试和模型中，GUI-KV优于竞争性的KV压缩基线，在适度预算下与完整缓存的准确性非常接近。值得注意的是，在AgentNetBench基准测试的5张截图设置中，GUI-KV将解码浮点运算次数（FLOPs）减少了38.9%，同时将步骤准确性提高了4.1%。这些结果表明，利用GUI特有的冗余性可以实现高效且可靠的代理性能。

English

Graphical user interface (GUI) agents built on vision-language models have emerged as a promising approach to automate human-computer workflows. However, they also face the inefficiency challenge as they process long sequences of high-resolution screenshots and solving long-horizon tasks, making inference slow, costly and memory-bound. While key-value (KV) caching can mitigate this, storing the full cache is prohibitive for image-heavy contexts. Existing cache-compression methods are sub-optimal as they do not account for the spatial and temporal redundancy of GUIs. In this work, we first analyze attention patterns in GUI agent workloads and find that, unlike in natural images, attention sparsity is uniformly high across all transformer layers. This insight motivates a simple uniform budget allocation strategy, which we show empirically outperforms more complex layer-varying schemes. Building on this, we introduce GUI-KV, a plug-and-play KV cache compression method for GUI agents that requires no retraining. GUI-KV combines two novel techniques: (i) spatial saliency guidance, which augments attention scores with the L2 norm of hidden states to better preserve semantically important visual tokens, and (ii) temporal redundancy scoring, which projects previous frames' keys onto the current frame's key subspace to preferentially prune redundant history. Across standard GUI agent benchmarks and models, GUI-KV outperforms competitive KV compression baselines, closely matching full-cache accuracy at modest budgets. Notably, in a 5-screenshot setting on the AgentNetBench benchmark, GUI-KV reduces decoding FLOPs by 38.9% while increasing step accuracy by 4.1% over the full-cache baseline. These results demonstrate that exploiting GUI-specific redundancies enables efficient and reliable agent performance.

GUI-KV：基于时空感知KV缓存的高效GUI智能体

GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness

摘要

Support