음성 대규모 언어 모델에서의 안전한 미세 조정이 안전 정렬을 깨뜨리는 현상

초록

기존 연구에 따르면 정렬된 모델을 양성 데이터로 미세 조정할 경우 텍스트 및 비전 모달리티에서 안전성이 저하되며, 표현 공간 내 유해 콘텐츠와의 근접성이 가장 큰 손상을 초래하는 샘플을 예측한다고 합니다. 그러나 기존 분석은 단일하고 분화되지 않은 임베딩 공간 내에서 수행되어 서로 다른 입력 속성이 취약성을 어떻게 다르게 주도하는지에 대한 의문을 남깁니다. 오디오는 구조적으로 더 풍부한 문제를 제기합니다: 양성 샘플은 단순히 '무엇을 말하느냐'를 통해서뿐만 아니라, 단어가 전혀 문제가 없더라도 '어떻게 소리내느냐'를 통해 유해 콘텐츠와 인접할 수 있습니다. 본 연구는 오디오 LLM에서의 양성 미세 조정 안전성에 대한 첫 체계적인 연구를 제시하며, 유해 콘텐츠와의 임베딩 공간 거리를 기준으로 양성 오디오를 선별하는 근접도 기반 필터링 프레임워크를 사용하여 3개의 최첨단 모델을 평가합니다. 각 모델의 내부 인코더와 함께 외부 참조 인코더를 사용하여 근접도를 의미론적, 음향적, 혼합 축으로 분해함으로써, 양성 미세 조정이 Jailbreak 성공률(JSR)을 한 자릿수에서 최고 87.12%까지 상승시킨다는 것을 보여줍니다. 결정적으로, 주요 취약성 축과 오디오 대 텍스트 미세 조정의 상대적 위험은 모두 아키텍처에 의해 조건 지어집니다. 즉, 각 모델의 인코더와 프로젝터가 오디오를 LLM의 입력 공간으로 어떻게 변환하느냐에 따라 결정됩니다. 우리는 두 가지 방어 기법을 제안합니다: 유해 임베딩으로부터의 거리를 최대화하도록 훈련 데이터를 필터링하는 방법과, 추론 시 텍스트 기반 시스템 프롬프트를 사용하는 방법으로, 둘 모두 아키텍처 수정 없이 JSR을 거의 0에 가깝게 감소시킵니다. 두 아키텍처에 대한 기계론적 분석을 통해, 고정된 인코더가 표현을 보존하는 동안 미세 조정이 후반부 레이어의 거부 회로를 선택적으로 억제하며, 심지어 이 억제 패턴도 아키텍처에 따라 조건 지어져 모달리티 간 행동 비대칭성을 반영한다는 사실을 밝혔습니다. 양성 미세 조정으로 인한 안전성 저하는 오디오 LLM에서 질적으로 구별되는 위험입니다.

English

Prior work shows that fine-tuning aligned models on benign data degrades safety in text and vision modalities, and that proximity to harmful content in representation space predicts which samples cause the most damage. However, existing analyses operate within a single, undifferentiated embedding space -- leaving open whether distinct input properties drive the vulnerability differently. Audio introduces a structurally richer problem: a benign sample can neighbor harmful content not only through what is said but through how it sounds, even when its words are entirely innocuous. We present the first systematic study of benign fine-tuning safety in Audio LLMs, evaluating three state-of-the-art models with a proximity-based filtering framework that selects benign audio by embedding-space distance to harmful content. By decomposing proximity into semantic, acoustic, and mixed axes using external reference encoders alongside each model's own internal encoder, we show that benign fine-tuning elevates Jailbreak Success Rate (JSR) from single digits to as high as 87.12%. Crucially, the dominant vulnerability axis and the relative risk of audio versus text fine-tuning are both architecture-conditioned -- determined by how each model's encoder and projector transform audio into the LLM's input space. We propose two defenses: filtering training data to maximize distance from harmful embeddings, and a textual system prompt at inference, both reducing JSR to near-zero without architectural modification. Our mechanistic analysis on two architectures reveals that fine-tuning selectively suppresses the late-layer refusal circuit while the frozen encoder preserves representations, and that even the suppression pattern is architecture-conditioned, mirroring the behavioral asymmetries across modalities. Safety degradation from benign fine-tuning is a qualitatively distinct risk in Audio LLMs.

음성 대규모 언어 모델에서의 안전한 미세 조정이 안전 정렬을 깨뜨리는 현상

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

초록

Support