ChatPaper.aiChatPaper

音频越狱:大型音频-语言模型越狱的开放综合基准

Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models

May 21, 2025
作者: Zirui Song, Qian Jiang, Mingxuan Cui, Mingzhe Li, Lang Gao, Zeyu Zhang, Zixiang Xu, Yanbo Wang, Chenxi Wang, Guangxian Ouyang, Zhenhao Chen, Xiuying Chen
cs.AI

摘要

大型音频语言模型(LAMs)的兴起既带来了潜力也伴随着风险,其音频输出可能包含有害或不道德的内容。然而,当前研究缺乏对LAM安全性的系统性、定量评估,尤其是在对抗越狱攻击方面,由于语音的时序性和语义特性,这一挑战尤为严峻。为填补这一空白,我们推出了AJailBench,这是首个专门设计用于评估LAM越狱漏洞的基准测试。我们首先构建了AJailBench-Base,一个包含1,495个对抗性音频提示的数据集,覆盖10个违反政策的类别,这些提示通过逼真的文本到语音合成技术从文本越狱攻击转换而来。利用该数据集,我们对多个最先进的LAM进行了评估,发现无一能在各类攻击中展现出一致的鲁棒性。为了进一步加强越狱测试并模拟更真实的攻击条件,我们提出了一种生成动态对抗变体的方法。我们的音频扰动工具包(APT)在时间、频率和幅度域上应用了定向失真。为了保留原始越狱意图,我们实施了语义一致性约束,并采用贝叶斯优化高效搜索既微妙又高效的扰动,从而生成了AJailBench-APT,一个扩展的优化对抗性音频样本数据集。我们的研究结果表明,即便是微小且语义保持的扰动,也能显著降低领先LAM的安全性能,这凸显了开发更为鲁棒且语义感知的防御机制的必要性。
English
The rise of Large Audio Language Models (LAMs) brings both potential and risks, as their audio outputs may contain harmful or unethical content. However, current research lacks a systematic, quantitative evaluation of LAM safety especially against jailbreak attacks, which are challenging due to the temporal and semantic nature of speech. To bridge this gap, we introduce AJailBench, the first benchmark specifically designed to evaluate jailbreak vulnerabilities in LAMs. We begin by constructing AJailBench-Base, a dataset of 1,495 adversarial audio prompts spanning 10 policy-violating categories, converted from textual jailbreak attacks using realistic text to speech synthesis. Using this dataset, we evaluate several state-of-the-art LAMs and reveal that none exhibit consistent robustness across attacks. To further strengthen jailbreak testing and simulate more realistic attack conditions, we propose a method to generate dynamic adversarial variants. Our Audio Perturbation Toolkit (APT) applies targeted distortions across time, frequency, and amplitude domains. To preserve the original jailbreak intent, we enforce a semantic consistency constraint and employ Bayesian optimization to efficiently search for perturbations that are both subtle and highly effective. This results in AJailBench-APT, an extended dataset of optimized adversarial audio samples. Our findings demonstrate that even small, semantically preserved perturbations can significantly reduce the safety performance of leading LAMs, underscoring the need for more robust and semantically aware defense mechanisms.

Summary

AI-Generated Summary

PDF42May 22, 2025