희소 오토인코더를 VLM의 적대적 공격 탐지를 위한 플러그 앤 플레이 방화벽으로 활용

초록

시각-언어 모델(VLM)은 빠르게 발전해 왔으며, 특히 에이전트 기반 시스템의 부상과 함께 실제 세계 애플리케이션에 점점 더 많이 배치되고 있다. 그러나 이들의 안전성은 상대적으로 제한된 관심을 받아왔다. 최신 독점 및 오픈 웨이트 VLM조차도 적대적 공격에 매우 취약한 상태로 남아 있어, 하위 애플리케이션에 상당한 위험을 노출시키고 있다. 본 연구에서는 희소 오토인코더(SAE)에 기반한 새로운 경량 적대적 공격 탐지 프레임워크인 SAEgis를 제안한다. 사전 훈련된 VLM에 SAE 모듈을 삽입하고 표준 재구성 목적 함수로 훈련함으로써, 학습된 희소 잠재 특징이 자연스럽게 공격 관련 신호를 포착한다는 것을 발견했다. 이러한 특징은 이전에 본 적 없는 입력 이미지라 할지라도, 입력 이미지가 적대적으로 변조되었는지 여부를 신뢰성 있게 분류할 수 있게 해준다. 광범위한 실험을 통해 SAEgis가 인-도메인, 교차-도메인 및 교차-공격 설정에서 강력한 성능을 달성하며, 특히 기존 기준선과 비교하여 교차-도메인 일반화에서 큰 향상을 보임을 확인했다. 또한, 여러 계층의 신호를 결합함으로써 강건성과 안정성이 더욱 향상된다. 본 연구는 VLM에서 적대적 공격 탐지를 위한 플러그 앤 플레이 메커니즘으로 SAE를 탐구한 최초의 연구이다. 우리의 방법은 추가적인 적대적 훈련을 요구하지 않으며, 오버헤드를 최소화하면서 실제 VLM 시스템의 안전성을 개선하기 위한 실용적인 접근 방식을 제공한다.

English

Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain highly vulnerable to adversarial attacks, leaving downstream applications exposed to significant risks. In this work, we propose a novel and lightweight adversarial attack detection framework based on sparse autoencoders (SAEs), termed SAEgis. By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, we find that the learned sparse latent features naturally capture attack-relevant signals. These features enable reliable classification of whether an input image has been adversarially perturbed, even for previously unseen samples. Extensive experiments show that SAEgis achieves strong performance across in-domain, cross-domain, and cross-attack settings, with particularly large improvements in cross-domain generalization compared to existing baselines. In addition, combining signals from multiple layers further improves robustness and stability. To the best of our knowledge, this is the first work to explore SAE as a plug-and-play mechanism for adversarial attack detection in VLMs. Our method requires no additional adversarial training, introduces minimal overhead, and provides a practical approach for improving the safety of real-world VLM systems.

희소 오토인코더를 VLM의 적대적 공격 탐지를 위한 플러그 앤 플레이 방화벽으로 활용

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

초록

Support