GUI-KV: 시공간 인식을 갖춘 KV 캐시를 통한 효율적인 GUI 에이전트

초록

비전-언어 모델을 기반으로 구축된 그래픽 사용자 인터페이스(GUI) 에이전트는 인간-컴퓨터 워크플로우를 자동화하는 유망한 접근 방식으로 부상했습니다. 그러나 이러한 에이전트는 고해상도 스크린샷의 긴 시퀀스를 처리하고 장기적 작업을 해결해야 하기 때문에 추론 속도가 느리고 비용이 많이 들며 메모리 제약을 받는 비효율성 문제에 직면해 있습니다. 키-값(KV) 캐싱은 이를 완화할 수 있지만, 이미지가 많은 환경에서는 전체 캐시를 저장하는 것이 실질적으로 불가능합니다. 기존의 캐시 압축 방법은 GUI의 공간적 및 시간적 중복성을 고려하지 않아 최적의 성능을 내지 못합니다. 본 연구에서는 먼저 GUI 에이전트 작업 부하에서의 어텐션 패턴을 분석하고, 자연 이미지와 달리 모든 트랜스포머 레이어에서 어텐션 희소성이 균일하게 높다는 사실을 발견했습니다. 이러한 통찰은 단순한 균일 예산 할당 전략을 제안하게 했으며, 이 전략이 더 복잡한 레이어별 변동 방식보다 실험적으로 우수함을 보였습니다. 이를 바탕으로, 재학습이 필요 없는 플러그 앤 플레이 방식의 GUI-KV 캐시 압축 방법을 소개합니다. GUI-KV는 두 가지 새로운 기술을 결합합니다: (i) 공간적 중요성 가이던스는 어텐션 점수를 은닉 상태의 L2 노름으로 보강하여 의미론적으로 중요한 시각적 토큰을 더 잘 보존하고, (ii) 시간적 중복성 점수화는 이전 프레임의 키를 현재 프레임의 키 부분공간에 투영하여 중복된 이력을 우선적으로 제거합니다. 표준 GUI 에이전트 벤치마크와 모델에서 GUI-KV는 경쟁력 있는 KV 압축 기준선을 능가하며, 적당한 예산으로 전체 캐시 정확도에 근접한 성능을 보입니다. 특히, AgentNetBench 벤치마크에서 5개의 스크린샷 설정에서 GUI-KV는 디코딩 FLOPs를 38.9% 줄이면서 단계 정확도를 전체 캐시 기준선 대비 4.1% 증가시켰습니다. 이러한 결과는 GUI 특유의 중복성을 활용하면 효율적이고 신뢰할 수 있는 에이전트 성능을 달성할 수 있음을 보여줍니다.

English

Graphical user interface (GUI) agents built on vision-language models have emerged as a promising approach to automate human-computer workflows. However, they also face the inefficiency challenge as they process long sequences of high-resolution screenshots and solving long-horizon tasks, making inference slow, costly and memory-bound. While key-value (KV) caching can mitigate this, storing the full cache is prohibitive for image-heavy contexts. Existing cache-compression methods are sub-optimal as they do not account for the spatial and temporal redundancy of GUIs. In this work, we first analyze attention patterns in GUI agent workloads and find that, unlike in natural images, attention sparsity is uniformly high across all transformer layers. This insight motivates a simple uniform budget allocation strategy, which we show empirically outperforms more complex layer-varying schemes. Building on this, we introduce GUI-KV, a plug-and-play KV cache compression method for GUI agents that requires no retraining. GUI-KV combines two novel techniques: (i) spatial saliency guidance, which augments attention scores with the L2 norm of hidden states to better preserve semantically important visual tokens, and (ii) temporal redundancy scoring, which projects previous frames' keys onto the current frame's key subspace to preferentially prune redundant history. Across standard GUI agent benchmarks and models, GUI-KV outperforms competitive KV compression baselines, closely matching full-cache accuracy at modest budgets. Notably, in a 5-screenshot setting on the AgentNetBench benchmark, GUI-KV reduces decoding FLOPs by 38.9% while increasing step accuracy by 4.1% over the full-cache baseline. These results demonstrate that exploiting GUI-specific redundancies enables efficient and reliable agent performance.

GUI-KV: 시공간 인식을 갖춘 KV 캐시를 통한 효율적인 GUI 에이전트

GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness

초록

Support