良性ファインチューニングは音声LLMの安全性アライメントを破壊する

要旨

先行研究では、良性データによるアライメント済みモデルのファインチューニングが、テキストおよび視覚モダリティにおいて安全性を低下させることが示されており、表現空間内での有害コンテンツへの近接性が、最も被害をもたらすサンプルを予測することが知られている。しかし、既存の分析は単一で未分化な埋め込み空間内で行われており、異なる入力特性が脆弱性を異なる方法で駆動するかどうかは未解明であった。音声は構造的に豊かな問題を提起する。良性サンプルが有害コンテンツに近接する要因は、**発話内容**だけでなく、**音声特性**にも起因し得るのである。たとえ発話語句が完全に無害であっても、その音の質や話し方によって有害コンテンツに近い表現を持つ可能性がある。本研究は、Audio LLMにおける良性ファインチューニングの安全性について、初めて体系的な検証を行う。埋め込み空間における有害コンテンツへの距離に基づいて良性音声を選別する近接性ベースのフィルタリング枠組みを用いて、3つの最先端モデルを評価する。各モデルの内部エンコーダに加えて外部参照エンコーダを用いて近接性を**意味的軸**、**音響的軸**、**混合軸**に分解することで、良性ファインチューニングにより Jailbreak Success Rate (JSR) が一桁から最大87.12%まで上昇することを示す。決定的に重要なのは、支配的な脆弱性の軸と、音声対テキストのファインチューニングに伴う相対的リスクの両方が、**アーキテクチャに条件付けられている** 点である。これは、各モデルのエンコーダとプロジェクタが音声をLLMの入力空間に変換する方法によって決定される。我々は2つの防御策を提案する。訓練データを有害な埋め込みから距離が最大化されるようフィルタリングする方法と、推論時にテキストによるシステムプロンプトを使用する方法である。いずれもアーキテクチャ変更なしにJSRをほぼゼロに低減できる。2つのアーキテクチャに対する機構的分析により、ファインチューニングが後段層の拒否回路を選択的に抑制する一方で、凍結されたエンコーダは表現を保持すること、そしてその抑制パターン自体もアーキテクチャに条件付けられており、モダリティ間の行動的非対称性を反映していることが明らかになった。良性ファインチューニングによる安全性の低下は、Audio LLMにおいて質的に異なるリスクなのである。

English

Prior work shows that fine-tuning aligned models on benign data degrades safety in text and vision modalities, and that proximity to harmful content in representation space predicts which samples cause the most damage. However, existing analyses operate within a single, undifferentiated embedding space -- leaving open whether distinct input properties drive the vulnerability differently. Audio introduces a structurally richer problem: a benign sample can neighbor harmful content not only through what is said but through how it sounds, even when its words are entirely innocuous. We present the first systematic study of benign fine-tuning safety in Audio LLMs, evaluating three state-of-the-art models with a proximity-based filtering framework that selects benign audio by embedding-space distance to harmful content. By decomposing proximity into semantic, acoustic, and mixed axes using external reference encoders alongside each model's own internal encoder, we show that benign fine-tuning elevates Jailbreak Success Rate (JSR) from single digits to as high as 87.12%. Crucially, the dominant vulnerability axis and the relative risk of audio versus text fine-tuning are both architecture-conditioned -- determined by how each model's encoder and projector transform audio into the LLM's input space. We propose two defenses: filtering training data to maximize distance from harmful embeddings, and a textual system prompt at inference, both reducing JSR to near-zero without architectural modification. Our mechanistic analysis on two architectures reveals that fine-tuning selectively suppresses the late-layer refusal circuit while the frozen encoder preserves representations, and that even the suppression pattern is architecture-conditioned, mirroring the behavioral asymmetries across modalities. Safety degradation from benign fine-tuning is a qualitatively distinct risk in Audio LLMs.

良性ファインチューニングは音声LLMの安全性アライメントを破壊する

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

要旨

Support