안전한 다중모드 대형 언어 모델을 위한 조향 자동화

초록

최근 멀티모달 대형 언어 모델(MLLMs)의 발전은 강력한 교차 모달 추론 능력을 가능하게 했지만, 특히 적대적 멀티모달 입력에 직면했을 때 새로운 안전 문제를 제기하기도 했습니다. MLLMs의 추론 과정에서 안전성을 향상시키기 위해, 우리는 기본 모델의 미세 조정 없이도 적용 가능한 모듈형 및 적응형 추론 시점 개입 기술인 AutoSteer를 소개합니다. AutoSteer는 세 가지 핵심 구성 요소를 포함합니다: (1) 모델의 내부 계층 간 가장 안전 관련성이 높은 차이를 자동으로 식별하는 새로운 안전 인식 점수(SAS); (2) 중간 표현에서 유해한 출력의 가능성을 추정하도록 훈련된 적응형 안전 탐색기; 그리고 (3) 안전 위험이 감지되었을 때 생성 과정을 선택적으로 조절하기 위해 개입하는 경량의 거부 헤드(Refusal Head). 다양한 안전-중요 벤치마크에서 LLaVA-OV와 Chameleon을 대상으로 한 실험은 AutoSteer가 텍스트, 시각 및 교차 모달 위협에 대한 공격 성공률(ASR)을 크게 감소시키면서도 일반적인 능력을 유지한다는 것을 보여줍니다. 이러한 결과는 AutoSteer를 멀티모달 AI 시스템의 안전한 배치를 위한 실용적, 해석 가능하며 효과적인 프레임워크로 자리매김합니다.

English

Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.

안전한 다중모드 대형 언어 모델을 위한 조향 자동화

Automating Steering for Safe Multimodal Large Language Models

초록

Support