稀疏自動編碼器作為視覺語言模型中對抗性攻擊檢測的即插即用防火牆

摘要

視覺語言模型（VLMs）近年來快速發展，並隨著基於智能體的系統興起，日益廣泛部署於實際應用中。然而，其安全性受到的關注相對有限。即便是最新的專有及開源權重視覺語言模型，仍極易遭受對抗攻擊，導致下游應用暴露於重大風險中。本研究提出一種基於稀疏自編碼器（SAEs）的新型輕量級對抗攻擊檢測框架，命名為SAEgis。透過在預訓練視覺語言模型中嵌入SAE模組，並以標準重建目標進行訓練，我們發現所學習的稀疏潛在特徵能自然捕捉與攻擊相關的訊號。這些特徵可實現對輸入影像是否遭受對抗擾動的可靠分類，即使面對從未見過的樣本亦能適用。大量實驗顯示，SAEgis在域內、跨域及跨攻擊設定下均展現優異性能，尤其在跨域泛化方面較現有基準方法有顯著提升。此外，結合多層特徵訊號能進一步增強穩健性與穩定性。據我們所知，此為首項探索將SAE作為視覺語言模型中即插即用式對抗攻擊檢測機制的研究。本方法無需額外對抗訓練，引入極小計算開銷，並提供增強實際視覺語言模型系統安全性的實用方案。

English

Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain highly vulnerable to adversarial attacks, leaving downstream applications exposed to significant risks. In this work, we propose a novel and lightweight adversarial attack detection framework based on sparse autoencoders (SAEs), termed SAEgis. By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, we find that the learned sparse latent features naturally capture attack-relevant signals. These features enable reliable classification of whether an input image has been adversarially perturbed, even for previously unseen samples. Extensive experiments show that SAEgis achieves strong performance across in-domain, cross-domain, and cross-attack settings, with particularly large improvements in cross-domain generalization compared to existing baselines. In addition, combining signals from multiple layers further improves robustness and stability. To the best of our knowledge, this is the first work to explore SAE as a plug-and-play mechanism for adversarial attack detection in VLMs. Our method requires no additional adversarial training, introduces minimal overhead, and provides a practical approach for improving the safety of real-world VLM systems.

稀疏自動編碼器作為視覺語言模型中對抗性攻擊檢測的即插即用防火牆

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

摘要

Support