JailDAM: 비전-언어 모델을 위한 적응형 메모리 기반 Jailbreak 탐지

초록

멀티모달 대형 언어 모델(MLLMs)은 시각-언어 작업에서 뛰어난 성능을 보이지만, 특히 재크브레이크(jailbreak) 공격을 통해 유해한 콘텐츠를 생성할 수 있는 상당한 위험도 내포하고 있습니다. 재크브레이크 공격은 모델의 안전 메커니즘을 우회하여 부적절하거나 안전하지 않은 콘텐츠를 생성하도록 의도적으로 조작하는 것을 의미합니다. 이러한 공격을 탐지하는 것은 MLLMs의 책임 있는 배포를 보장하기 위해 매우 중요합니다. 기존의 재크브레이크 탐지 방법은 세 가지 주요 과제에 직면해 있습니다: (1) 많은 방법이 모델의 은닉 상태(hidden states)나 그래디언트(gradients)에 의존하여, 모델의 내부 작동에 접근할 수 있는 화이트박스(white-box) 모델에만 적용 가능하다는 점, (2) 불확실성 기반 분석으로 인한 높은 계산 오버헤드로 실시간 탐지가 제한된다는 점, 그리고 (3) 완전히 라벨링된 유해 데이터셋이 필요한데, 이러한 데이터셋은 실제 환경에서 흔치 않다는 점입니다. 이러한 문제를 해결하기 위해, 우리는 JAILDAM이라는 테스트 시간 적응형 프레임워크를 소개합니다. 우리의 방법은 정책 기반의 안전하지 않은 지식 표현을 통해 메모리 기반 접근 방식을 활용하여, 유해 데이터에 명시적으로 노출될 필요를 없앱니다. 테스트 시간 동안 안전하지 않은 지식을 동적으로 업데이트함으로써, 우리의 프레임워크는 효율성을 유지하면서도 보이지 않는 재크브레이크 전략에 대한 일반화를 개선합니다. 여러 VLM 재크브레이크 벤치마크에서의 실험 결과, JAILDAM은 유해 콘텐츠 탐지에서 최첨단 성능을 보여주며 정확도와 속도 모두를 개선했습니다.

English

Multimodal large language models (MLLMs) excel in vision-language tasks but also pose significant risks of generating harmful content, particularly through jailbreak attacks. Jailbreak attacks refer to intentional manipulations that bypass safety mechanisms in models, leading to the generation of inappropriate or unsafe content. Detecting such attacks is critical to ensuring the responsible deployment of MLLMs. Existing jailbreak detection methods face three primary challenges: (1) Many rely on model hidden states or gradients, limiting their applicability to white-box models, where the internal workings of the model are accessible; (2) They involve high computational overhead from uncertainty-based analysis, which limits real-time detection, and (3) They require fully labeled harmful datasets, which are often scarce in real-world settings. To address these issues, we introduce a test-time adaptive framework called JAILDAM. Our method leverages a memory-based approach guided by policy-driven unsafe knowledge representations, eliminating the need for explicit exposure to harmful data. By dynamically updating unsafe knowledge during test-time, our framework improves generalization to unseen jailbreak strategies while maintaining efficiency. Experiments on multiple VLM jailbreak benchmarks demonstrate that JAILDAM delivers state-of-the-art performance in harmful content detection, improving both accuracy and speed.

JailDAM: 비전-언어 모델을 위한 적응형 메모리 기반 Jailbreak 탐지

JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model

초록

Support