自动化引导确保多模态大语言模型的安全性

摘要

近期，多模态大语言模型（MLLMs）的进展不仅解锁了强大的跨模态推理能力，也引发了新的安全隐患，尤其是在面对对抗性多模态输入时。为提升MLLMs在推理阶段的安全性，我们引入了一种模块化且自适应的推理时干预技术——AutoSteer，无需对基础模型进行任何微调。AutoSteer包含三大核心组件：(1) 一种新颖的安全意识评分（SAS），能自动识别模型内部各层中最具安全相关性的差异；(2) 一个自适应安全探测器，训练用于从中间表示中估计有害输出的可能性；(3) 一个轻量级的拒绝头（Refusal Head），在检测到安全风险时选择性介入，调节生成过程。在LLaVA-OV和Chameleon模型上，针对多种安全关键基准的实验表明，AutoSteer显著降低了文本、视觉及跨模态威胁的攻击成功率（ASR），同时保持了模型的通用能力。这些发现确立了AutoSteer作为一个实用、可解释且有效的框架，为多模态AI系统的安全部署提供了有力保障。

English

Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.

自动化引导确保多模态大语言模型的安全性

Automating Steering for Safe Multimodal Large Language Models

摘要

Support