自動化導向以確保多模態大型語言模型的安全性

摘要

多模態大型語言模型（MLLMs）的最新進展釋放了強大的跨模態推理能力，但也引發了新的安全隱憂，尤其是在面對對抗性多模態輸入時。為了提升MLLMs在推理過程中的安全性，我們引入了一種模組化且自適應的推理時干預技術——AutoSteer，無需對底層模型進行任何微調。AutoSteer整合了三個核心組件：(1) 一種新穎的安全意識評分（SAS），能自動識別模型內部層次間最相關的安全差異；(2) 一個自適應的安全探測器，訓練來估計從中間表示生成有害輸出的可能性；以及(3) 一個輕量級的拒絕頭（Refusal Head），在檢測到安全風險時選擇性地介入以調節生成過程。在LLaVA-OV和Chameleon模型上，針對多樣化的安全關鍵基準測試的實驗表明，AutoSteer顯著降低了文本、視覺及跨模態威脅的攻擊成功率（ASR），同時保持了模型的通用能力。這些發現使AutoSteer成為一個實用、可解釋且有效的框架，為多模態AI系統的安全部署提供了保障。

English

Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.

自動化導向以確保多模態大型語言模型的安全性

Automating Steering for Safe Multimodal Large Language Models

摘要

Support