ChatPaper.aiChatPaper

音頻越獄:大型音頻語言模型越獄的開放綜合基準

Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models

May 21, 2025
作者: Zirui Song, Qian Jiang, Mingxuan Cui, Mingzhe Li, Lang Gao, Zeyu Zhang, Zixiang Xu, Yanbo Wang, Chenxi Wang, Guangxian Ouyang, Zhenhao Chen, Xiuying Chen
cs.AI

摘要

大型音頻語言模型(LAMs)的興起既帶來了潛力,也伴隨著風險,因為其音頻輸出可能包含有害或不道德的內容。然而,目前的研究缺乏對LAM安全性的系統性、定量評估,尤其是在對抗越獄攻擊方面,這由於語音的時序性和語義特性而具有挑戰性。為彌補這一空白,我們引入了AJailBench,這是首個專門用於評估LAM越獄漏洞的基準。我們首先構建了AJailBench-Base,這是一個包含1,495個對抗性音頻提示的數據集,涵蓋10個違反政策的類別,這些提示是通過真實的文本到語音合成從文本越獄攻擊轉換而來。利用該數據集,我們評估了多個最先進的LAM,發現沒有一個模型能在所有攻擊中表現出一致的魯棒性。為了進一步加強越獄測試並模擬更真實的攻擊條件,我們提出了一種生成動態對抗變體的方法。我們的音頻擾動工具包(APT)在時間、頻率和幅度域上應用有針對性的失真。為了保留原始的越獄意圖,我們強制執行語義一致性約束,並採用貝葉斯優化來高效搜索既細微又高效的擾動。這產生了AJailBench-APT,這是一個包含優化對抗性音頻樣本的擴展數據集。我們的研究結果表明,即使是微小且語義保留的擾動,也能顯著降低領先LAM的安全性能,這凸顯了對更魯棒和語義感知的防禦機制的需求。
English
The rise of Large Audio Language Models (LAMs) brings both potential and risks, as their audio outputs may contain harmful or unethical content. However, current research lacks a systematic, quantitative evaluation of LAM safety especially against jailbreak attacks, which are challenging due to the temporal and semantic nature of speech. To bridge this gap, we introduce AJailBench, the first benchmark specifically designed to evaluate jailbreak vulnerabilities in LAMs. We begin by constructing AJailBench-Base, a dataset of 1,495 adversarial audio prompts spanning 10 policy-violating categories, converted from textual jailbreak attacks using realistic text to speech synthesis. Using this dataset, we evaluate several state-of-the-art LAMs and reveal that none exhibit consistent robustness across attacks. To further strengthen jailbreak testing and simulate more realistic attack conditions, we propose a method to generate dynamic adversarial variants. Our Audio Perturbation Toolkit (APT) applies targeted distortions across time, frequency, and amplitude domains. To preserve the original jailbreak intent, we enforce a semantic consistency constraint and employ Bayesian optimization to efficiently search for perturbations that are both subtle and highly effective. This results in AJailBench-APT, an extended dataset of optimized adversarial audio samples. Our findings demonstrate that even small, semantically preserved perturbations can significantly reduce the safety performance of leading LAMs, underscoring the need for more robust and semantically aware defense mechanisms.

Summary

AI-Generated Summary

PDF42May 22, 2025