当优质音频遭遇对抗性攻击:利用良性输入破解音频-语言模型
When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs
August 5, 2025
作者: Bodam Kim, Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin
cs.AI
摘要
随着大型语言模型日益融入日常生活,音频已成为人机交互的关键界面。然而,这种便利性也带来了新的安全漏洞,使音频成为潜在的攻击面。我们的研究提出了WhisperInject,一种两阶段对抗性音频攻击框架,能够操控最先进的音频语言模型生成有害内容。该方法利用音频输入中人类难以察觉的微小扰动,这些扰动对人类听众而言是良性的。第一阶段采用了一种新颖的基于奖励的优化方法——结合强化学习与投影梯度下降(RL-PGD),引导目标模型绕过其自身的安全协议,生成原生有害响应。这一原生有害响应随后作为第二阶段——载荷注入的目标,在此阶段,我们使用投影梯度下降(PGD)优化嵌入到良性音频载体(如天气查询或问候信息)中的细微扰动。在严格的StrongREJECT、LlamaGuard以及人类评估安全框架的验证下,我们的实验在Qwen2.5-Omni-3B、Qwen2.5-Omni-7B和Phi-4-Multimodal模型上展示了超过86%的成功率。本研究揭示了一类新型的、实际可行的音频原生威胁,超越了理论上的漏洞利用,展示了一种可行且隐蔽的操控AI行为的方法。
English
As large language models become increasingly integrated into daily life,
audio has emerged as a key interface for human-AI interaction. However, this
convenience also introduces new vulnerabilities, making audio a potential
attack surface for adversaries. Our research introduces WhisperInject, a
two-stage adversarial audio attack framework that can manipulate
state-of-the-art audio language models to generate harmful content. Our method
uses imperceptible perturbations in audio inputs that remain benign to human
listeners. The first stage uses a novel reward-based optimization method,
Reinforcement Learning with Projected Gradient Descent (RL-PGD), to guide the
target model to circumvent its own safety protocols and generate harmful native
responses. This native harmful response then serves as the target for Stage 2,
Payload Injection, where we use Projected Gradient Descent (PGD) to optimize
subtle perturbations that are embedded into benign audio carriers, such as
weather queries or greeting messages. Validated under the rigorous
StrongREJECT, LlamaGuard, as well as Human Evaluation safety evaluation
framework, our experiments demonstrate a success rate exceeding 86% across
Qwen2.5-Omni-3B, Qwen2.5-Omni-7B, and Phi-4-Multimodal. Our work demonstrates a
new class of practical, audio-native threats, moving beyond theoretical
exploits to reveal a feasible and covert method for manipulating AI behavior.