SpeechGuard:探索多模态大型语言模型的对抗鲁棒性
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models
May 14, 2024
作者: Raghuveer Peri, Sai Muralidhar Jayanthi, Srikanth Ronanki, Anshu Bhatia, Karel Mundnich, Saket Dingliwal, Nilaksh Das, Zejiang Hou, Goeric Huybrechts, Srikanth Vishnubhotla, Daniel Garcia-Romero, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff
cs.AI
摘要
近来,集成语音和大型语言模型(SLMs)能够遵循语音指令并生成相关文本响应的技术备受青睐。然而,这些模型的安全性和鲁棒性仍然存在较大的不确定性。在这项研究中,我们调查了这类遵循指令的语音-语言模型对敌对攻击和越狱的潜在脆弱性。具体而言,我们设计了能够在白盒和黑盒攻击环境下生成越狱SLMs的敌对示例的算法,而无需人工干预。此外,我们提出了应对此类越狱攻击的对策。我们的模型在带有语音指令的对话数据上训练,在口语问答任务上取得了最先进的性能,安全性和帮助性指标均超过80%。尽管有安全防护措施,对越狱的实验表明SLMs对敌对扰动和转移攻击的脆弱性,当在一个涵盖12种不同有害类别的精心设计的问题数据集上评估时,平均攻击成功率分别为90%和10%。然而,我们证明了我们提出的对策显著降低了攻击成功率。
English
Integrated Speech and Large Language Models (SLMs) that can follow speech
instructions and generate relevant text responses have gained popularity
lately. However, the safety and robustness of these models remains largely
unclear. In this work, we investigate the potential vulnerabilities of such
instruction-following speech-language models to adversarial attacks and
jailbreaking. Specifically, we design algorithms that can generate adversarial
examples to jailbreak SLMs in both white-box and black-box attack settings
without human involvement. Additionally, we propose countermeasures to thwart
such jailbreaking attacks. Our models, trained on dialog data with speech
instructions, achieve state-of-the-art performance on spoken question-answering
task, scoring over 80% on both safety and helpfulness metrics. Despite safety
guardrails, experiments on jailbreaking demonstrate the vulnerability of SLMs
to adversarial perturbations and transfer attacks, with average attack success
rates of 90% and 10% respectively when evaluated on a dataset of carefully
designed harmful questions spanning 12 different toxic categories. However, we
demonstrate that our proposed countermeasures reduce the attack success
significantly.Summary
AI-Generated Summary