GUI-KV: Efficiënte GUI-agents via KV-cache met ruimtelijk-temporeel bewustzijn

Samenvatting

Grafische gebruikersinterface (GUI) agents gebouwd op visie-taalmodellen zijn naar voren gekomen als een veelbelovende aanpak om mens-computer workflows te automatiseren. Ze worden echter ook geconfronteerd met het inefficiëntieprobleem, omdat ze lange sequenties van hoogresolutie schermafbeeldingen verwerken en taken met een lange horizon oplossen, wat de inferentie traag, kostbaar en geheugenintensief maakt. Hoewel key-value (KV) caching dit kan verzachten, is het opslaan van de volledige cache onhaalbaar voor beeldrijke contexten. Bestaande cachecompressiemethoden zijn suboptimaal omdat ze geen rekening houden met de ruimtelijke en temporele redundantie van GUI's. In dit werk analyseren we eerst aandachtspatronen in GUI-agentworkloads en ontdekken dat, in tegenstelling tot natuurlijke afbeeldingen, de aandachtssparsheid uniform hoog is over alle transformerlagen. Dit inzicht motiveert een eenvoudige uniforme budgettoewijzingsstrategie, die empirisch beter presteert dan complexere laagvariërende schema's. Hierop voortbouwend introduceren we GUI-KV, een plug-and-play KV-cachecompressiemethode voor GUI-agents die geen hertraining vereist. GUI-KV combineert twee nieuwe technieken: (i) ruimtelijke saliëntiebegeleiding, die aandachtsscores aanvult met de L2-norm van verborgen toestanden om visuele tokens met semantisch belang beter te behouden, en (ii) temporele redundantiescoring, die sleutels van vorige frames projecteert op de sleutelsubruimte van het huidige frame om redundante geschiedenis selectief te verwijderen. Over standaard GUI-agentbenchmarks en modellen presteert GUI-KV beter dan competitieve KV-compressiebaselines en benadert het nauwkeurig de nauwkeurigheid van de volledige cache bij bescheiden budgetten. Opmerkelijk is dat in een instelling met 5 schermafbeeldingen op de AgentNetBench-benchmark GUI-KV de decodeer-FLOPs met 38,9% vermindert terwijl de stapnauwkeurigheid met 4,1% toeneemt ten opzichte van de volledige cachebaseline. Deze resultaten tonen aan dat het benutten van GUI-specifieke redundanties efficiënte en betrouwbare agentprestaties mogelijk maakt.

English

Graphical user interface (GUI) agents built on vision-language models have emerged as a promising approach to automate human-computer workflows. However, they also face the inefficiency challenge as they process long sequences of high-resolution screenshots and solving long-horizon tasks, making inference slow, costly and memory-bound. While key-value (KV) caching can mitigate this, storing the full cache is prohibitive for image-heavy contexts. Existing cache-compression methods are sub-optimal as they do not account for the spatial and temporal redundancy of GUIs. In this work, we first analyze attention patterns in GUI agent workloads and find that, unlike in natural images, attention sparsity is uniformly high across all transformer layers. This insight motivates a simple uniform budget allocation strategy, which we show empirically outperforms more complex layer-varying schemes. Building on this, we introduce GUI-KV, a plug-and-play KV cache compression method for GUI agents that requires no retraining. GUI-KV combines two novel techniques: (i) spatial saliency guidance, which augments attention scores with the L2 norm of hidden states to better preserve semantically important visual tokens, and (ii) temporal redundancy scoring, which projects previous frames' keys onto the current frame's key subspace to preferentially prune redundant history. Across standard GUI agent benchmarks and models, GUI-KV outperforms competitive KV compression baselines, closely matching full-cache accuracy at modest budgets. Notably, in a 5-screenshot setting on the AgentNetBench benchmark, GUI-KV reduces decoding FLOPs by 38.9% while increasing step accuracy by 4.1% over the full-cache baseline. These results demonstrate that exploiting GUI-specific redundancies enables efficient and reliable agent performance.

GUI-KV: Efficiënte GUI-agents via KV-cache met ruimtelijk-temporeel bewustzijn

GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness

Samenvatting

Support