自動化導向以確保多模態大型語言模型的安全性
Automating Steering for Safe Multimodal Large Language Models
July 17, 2025
作者: Lyucheng Wu, Mengru Wang, Ziwen Xu, Tri Cao, Nay Oo, Bryan Hooi, Shumin Deng
cs.AI
摘要
多模態大型語言模型(MLLMs)的最新進展釋放了強大的跨模態推理能力,但也引發了新的安全隱憂,尤其是在面對對抗性多模態輸入時。為了提升MLLMs在推理過程中的安全性,我們引入了一種模組化且自適應的推理時干預技術——AutoSteer,無需對底層模型進行任何微調。AutoSteer整合了三個核心組件:(1) 一種新穎的安全意識評分(SAS),能自動識別模型內部層次間最相關的安全差異;(2) 一個自適應的安全探測器,訓練來估計從中間表示生成有害輸出的可能性;以及(3) 一個輕量級的拒絕頭(Refusal Head),在檢測到安全風險時選擇性地介入以調節生成過程。在LLaVA-OV和Chameleon模型上,針對多樣化的安全關鍵基準測試的實驗表明,AutoSteer顯著降低了文本、視覺及跨模態威脅的攻擊成功率(ASR),同時保持了模型的通用能力。這些發現使AutoSteer成為一個實用、可解釋且有效的框架,為多模態AI系統的安全部署提供了保障。
English
Recent progress in Multimodal Large Language Models (MLLMs) has unlocked
powerful cross-modal reasoning abilities, but also raised new safety concerns,
particularly when faced with adversarial multimodal inputs. To improve the
safety of MLLMs during inference, we introduce a modular and adaptive
inference-time intervention technology, AutoSteer, without requiring any
fine-tuning of the underlying model. AutoSteer incorporates three core
components: (1) a novel Safety Awareness Score (SAS) that automatically
identifies the most safety-relevant distinctions among the model's internal
layers; (2) an adaptive safety prober trained to estimate the likelihood of
toxic outputs from intermediate representations; and (3) a lightweight Refusal
Head that selectively intervenes to modulate generation when safety risks are
detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical
benchmarks demonstrate that AutoSteer significantly reduces the Attack Success
Rate (ASR) for textual, visual, and cross-modal threats, while maintaining
general abilities. These findings position AutoSteer as a practical,
interpretable, and effective framework for safer deployment of multimodal AI
systems.