良性微调破坏音频大语言模型的安全对齐机制

摘要

现有研究表明，在良性数据上对对齐模型进行微调会降低文本和视觉模态的安全性，且表征空间中与有害内容的邻近度可预测哪些样本会造成最大损害。然而，现有分析均在单一、未分化的嵌入空间内进行，未能揭示不同输入特性是否会以不同方式驱动脆弱性。音频模态呈现出结构更复杂的问题：良性样本不仅可能通过语义内容（说什么）与有害内容相邻，还可能通过声学特征（如何说）产生邻近性——即使其词汇完全无害。我们首次对音频大语言模型的良性微调安全性进行系统研究，通过基于邻近度的过滤框架（依据嵌入空间与有害内容的距离筛选良性音频），评估了三种前沿模型。通过结合外部参考编码器与各模型内部编码器，将邻近性分解为语义轴、声学轴和混合轴，我们发现良性微调可使越狱成功率从个位数攀升至最高87.12%。关键的是，主导脆弱性轴以及音频与文本微调的相对风险均受架构条件制约——取决于各模型的编码器和投影器如何将音频转换至大语言模型的输入空间。我们提出两种防御方案：通过最大化训练数据与有害嵌入的距离进行过滤，以及在推理时使用文本系统提示。这两种方法无需修改架构即可将越狱成功率降至接近零。对两种架构的机理分析表明，微调会选择性抑制拒绝响应回路（尤其体现在深层网络），而冻结的编码器仍保留表征能力；甚至这种抑制模式也受架构条件制约，与跨模态的行为不对称性相呼应。良性微调导致的安全性退化是音频大语言模型中性质独特的风险。

English

Prior work shows that fine-tuning aligned models on benign data degrades safety in text and vision modalities, and that proximity to harmful content in representation space predicts which samples cause the most damage. However, existing analyses operate within a single, undifferentiated embedding space -- leaving open whether distinct input properties drive the vulnerability differently. Audio introduces a structurally richer problem: a benign sample can neighbor harmful content not only through what is said but through how it sounds, even when its words are entirely innocuous. We present the first systematic study of benign fine-tuning safety in Audio LLMs, evaluating three state-of-the-art models with a proximity-based filtering framework that selects benign audio by embedding-space distance to harmful content. By decomposing proximity into semantic, acoustic, and mixed axes using external reference encoders alongside each model's own internal encoder, we show that benign fine-tuning elevates Jailbreak Success Rate (JSR) from single digits to as high as 87.12%. Crucially, the dominant vulnerability axis and the relative risk of audio versus text fine-tuning are both architecture-conditioned -- determined by how each model's encoder and projector transform audio into the LLM's input space. We propose two defenses: filtering training data to maximize distance from harmful embeddings, and a textual system prompt at inference, both reducing JSR to near-zero without architectural modification. Our mechanistic analysis on two architectures reveals that fine-tuning selectively suppresses the late-layer refusal circuit while the frozen encoder preserves representations, and that even the suppression pattern is architecture-conditioned, mirroring the behavioral asymmetries across modalities. Safety degradation from benign fine-tuning is a qualitatively distinct risk in Audio LLMs.

良性微调破坏音频大语言模型的安全对齐机制

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

摘要

Support