稀疏自编码器作为视觉语言模型中对抗攻击检测的即插即用防火墙

摘要

视觉-语言模型（VLM）近年来发展迅速，尤其在基于智能体的系统兴起后，被越来越多地部署于实际应用场景中。然而，其安全性问题受到的关注相对有限。即便是最新的专有模型和开源权重VLM，仍极易受到对抗攻击的影响，导致下游应用面临显著风险。本文提出一种基于稀疏自编码器（SAE）的新型轻量级对抗攻击检测框架，命名为SAEgis。通过将SAE模块插入预训练VLM，并采用标准重构目标进行训练，我们发现学习到的稀疏潜在特征能够自然捕捉攻击相关信号。这些特征使模型能够可靠地判断输入图像是否受到对抗扰动——即使对于未见过的样本也是如此。大量实验表明，SAEgis在域内、跨域和跨攻击场景下均表现出色，尤其在跨域泛化方面相较于现有基线方法提升显著。此外，融合多层信号进一步增强了检测的鲁棒性和稳定性。据我们所知，这是首次探索将SAE作为即插即用机制用于VLM对抗攻击检测的工作。该方法无需额外对抗训练，引入的额外开销极低，为提升实际VLM系统的安全性提供了一种实用途径。

English

Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain highly vulnerable to adversarial attacks, leaving downstream applications exposed to significant risks. In this work, we propose a novel and lightweight adversarial attack detection framework based on sparse autoencoders (SAEs), termed SAEgis. By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, we find that the learned sparse latent features naturally capture attack-relevant signals. These features enable reliable classification of whether an input image has been adversarially perturbed, even for previously unseen samples. Extensive experiments show that SAEgis achieves strong performance across in-domain, cross-domain, and cross-attack settings, with particularly large improvements in cross-domain generalization compared to existing baselines. In addition, combining signals from multiple layers further improves robustness and stability. To the best of our knowledge, this is the first work to explore SAE as a plug-and-play mechanism for adversarial attack detection in VLMs. Our method requires no additional adversarial training, introduces minimal overhead, and provides a practical approach for improving the safety of real-world VLM systems.

稀疏自编码器作为视觉语言模型中对抗攻击检测的即插即用防火墙

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

摘要

Support