安全なマルチモーダル大規模言語モデルのための自動操舵

要旨

近年のマルチモーダル大規模言語モデル（MLLM）の進展は、強力なクロスモーダル推論能力を実現する一方で、特に敵対的なマルチモーダル入力に直面した際の新たな安全性の懸念を引き起こしています。推論時のMLLMの安全性を向上させるため、我々は基盤モデルのファインチューニングを必要としない、モジュール式で適応的な推論時介入技術「AutoSteer」を提案します。AutoSteerは以下の3つのコアコンポーネントを統合しています：(1) モデルの内部層間で最も安全性に関連する差異を自動的に識別する新規の「Safety Awareness Score（SAS）」、(2) 中間表現から有害な出力の可能性を推定するように訓練された適応型安全性プローブ、(3) 安全性リスクが検出された際に生成を調整するために選択的に介入する軽量な「Refusal Head」です。LLaVA-OVおよびChameleonを用いた多様な安全性重視のベンチマーク実験により、AutoSteerがテキスト、視覚、クロスモーダルの脅威に対する攻撃成功率（ASR）を大幅に低減しつつ、一般的な能力を維持することが実証されました。これらの知見は、AutoSteerをマルチモーダルAIシステムのより安全な展開に向けた実用的で解釈可能かつ効果的なフレームワークとして位置づけています。

English

Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.

安全なマルチモーダル大規模言語モデルのための自動操舵

Automating Steering for Safe Multimodal Large Language Models

要旨

Support