當良音變敵對:以無害輸入破解音頻語言模型
When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs
August 5, 2025
作者: Bodam Kim, Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin
cs.AI
摘要
隨著大型語言模型日益融入日常生活,音頻已成為人機互動的關鍵介面。然而,這種便利性也帶來了新的脆弱性,使音頻成為潛在的攻擊面。本研究提出了WhisperInject,這是一個兩階段對抗性音頻攻擊框架,能夠操控最先進的音頻語言模型生成有害內容。我們的方法利用音頻輸入中難以察覺的擾動,這些擾動對人類聽眾而言仍保持無害。第一階段採用了一種新穎的基於獎勵的優化方法——帶有投影梯度下降的強化學習(RL-PGD),來引導目標模型繞過其自身的安全協議,生成有害的原生回應。此原生有害回應隨後作為第二階段——有效載荷注入的目標,在此階段我們使用投影梯度下降(PGD)來優化嵌入到良性音頻載體(如天氣查詢或問候訊息)中的細微擾動。在嚴格的StrongREJECT、LlamaGuard以及人類評估安全評估框架的驗證下,我們的實驗在Qwen2.5-Omni-3B、Qwen2.5-Omni-7B和Phi-4-Multimodal模型上展現了超過86%的成功率。我們的工作展示了一類新的實用、音頻原生的威脅,超越了理論上的利用,揭示了一種可行且隱蔽的操控AI行為的方法。
English
As large language models become increasingly integrated into daily life,
audio has emerged as a key interface for human-AI interaction. However, this
convenience also introduces new vulnerabilities, making audio a potential
attack surface for adversaries. Our research introduces WhisperInject, a
two-stage adversarial audio attack framework that can manipulate
state-of-the-art audio language models to generate harmful content. Our method
uses imperceptible perturbations in audio inputs that remain benign to human
listeners. The first stage uses a novel reward-based optimization method,
Reinforcement Learning with Projected Gradient Descent (RL-PGD), to guide the
target model to circumvent its own safety protocols and generate harmful native
responses. This native harmful response then serves as the target for Stage 2,
Payload Injection, where we use Projected Gradient Descent (PGD) to optimize
subtle perturbations that are embedded into benign audio carriers, such as
weather queries or greeting messages. Validated under the rigorous
StrongREJECT, LlamaGuard, as well as Human Evaluation safety evaluation
framework, our experiments demonstrate a success rate exceeding 86% across
Qwen2.5-Omni-3B, Qwen2.5-Omni-7B, and Phi-4-Multimodal. Our work demonstrates a
new class of practical, audio-native threats, moving beyond theoretical
exploits to reveal a feasible and covert method for manipulating AI behavior.