SlowBA：一种针对基于VLM的GUI代理的高效后门攻击方案

摘要

基于视觉语言模型（VLM）的现代图形用户界面（GUI）智能体不仅需要准确执行操作，还需以低延迟响应用户指令。当前针对GUI智能体安全性的研究主要集中于操控动作准确性，而与响应效率相关的安全风险尚未得到充分探索。本文提出SlowBA——一种针对VLM型GUI智能体响应速度的新型后门攻击。其核心思想是通过特定触发模式诱导模型生成过长的推理链，从而操控响应延迟。为实现这一目标，我们设计了两阶段奖励级后门注入（RBI）策略：首先对齐长响应格式，随后通过强化学习实现触发模式感知激活。此外，我们设计了自然出现在GUI环境中的弹窗触发器，有效提升了攻击的隐蔽性。跨多数据集和基线的实验表明，SlowBA能在基本保持任务准确性的同时，显著增加响应长度与延迟。即使在小规模数据污染比例及多种防御设置下，该攻击仍保持有效性。这些发现揭示了GUI智能体领域先前被忽视的安全漏洞，强调了需同时兼顾动作准确性与响应效率的防御机制必要性。代码详见https://github.com/tu-tuing/SlowBA。

English

Modern vision-language-model (VLM) based graphical user interface (GUI) agents are expected not only to execute actions accurately but also to respond to user instructions with low latency. While existing research on GUI-agent security mainly focuses on manipulating action correctness, the security risks related to response efficiency remain largely unexplored. In this paper, we introduce SlowBA, a novel backdoor attack that targets the responsiveness of VLM-based GUI agents. The key idea is to manipulate response latency by inducing excessively long reasoning chains under specific trigger patterns. To achieve this, we propose a two-stage reward-level backdoor injection (RBI) strategy that first aligns the long-response format and then learns trigger-aware activation through reinforcement learning. In addition, we design realistic pop-up windows as triggers that naturally appear in GUI environments, improving the stealthiness of the attack. Extensive experiments across multiple datasets and baselines demonstrate that SlowBA can significantly increase response length and latency while largely preserving task accuracy. The attack remains effective even with a small poisoning ratio and under several defense settings. These findings reveal a previously overlooked security vulnerability in GUI agents and highlight the need for defenses that consider both action correctness and response efficiency. Code can be found in https://github.com/tu-tuing/SlowBA.