VLMsにおける敵対的攻撃検出のためのプラグアンドプレイファイアウォールとしてのスパースオートエンコーダ

要旨

視覚言語モデル（VLM）は急速に進歩し、特にエージェントベースのシステムの台頭に伴い、実世界のアプリケーションへの導入が拡大している。しかし、その安全性に関する研究は比較的限られている。最新のプロプライエタリおよびオープンウェイトのVLMでさえ、敵対的攻撃に対して極めて脆弱であり、下流のアプリケーションに重大なリスクをもたらしている。本研究では、スパースオートエンコーダ（SAE）に基づく新規かつ軽量な敵対的攻撃検出フレームワーク「SAEgis」を提案する。事前学習済みVLMにSAEモジュールを挿入し、標準的な再構成目的関数で訓練することで、学習されたスパースな潜在特徴が攻撃関連シグナルを自然に捕捉することを見出した。これらの特徴により、未知のサンプルであっても、入力画像が敵対的に摂動されたかどうかを確実に分類できる。大規模な実験により、SAEgisはドメイン内、クロスドメイン、およびクロス攻撃設定において優れた性能を達成し、特にクロスドメイン汎化において既存のベースラインと比較して大きな改善を示す。さらに、複数層からのシグナルを組み合わせることで、ロバスト性と安定性がさらに向上する。我々の知る限り、これはVLMにおける敵対的攻撃検出のためのプラグアンドプレイ機構としてSAEを探求した最初の研究である。本手法は敵対的訓練を一切必要とせず、最小限のオーバーヘッドを導入するだけで、実世界のVLMシステムの安全性向上に実用的なアプローチを提供する。

English

Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain highly vulnerable to adversarial attacks, leaving downstream applications exposed to significant risks. In this work, we propose a novel and lightweight adversarial attack detection framework based on sparse autoencoders (SAEs), termed SAEgis. By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, we find that the learned sparse latent features naturally capture attack-relevant signals. These features enable reliable classification of whether an input image has been adversarially perturbed, even for previously unseen samples. Extensive experiments show that SAEgis achieves strong performance across in-domain, cross-domain, and cross-attack settings, with particularly large improvements in cross-domain generalization compared to existing baselines. In addition, combining signals from multiple layers further improves robustness and stability. To the best of our knowledge, this is the first work to explore SAE as a plug-and-play mechanism for adversarial attack detection in VLMs. Our method requires no additional adversarial training, introduces minimal overhead, and provides a practical approach for improving the safety of real-world VLM systems.

VLMsにおける敵対的攻撃検出のためのプラグアンドプレイファイアウォールとしてのスパースオートエンコーダ

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

要旨

Support