ChatPaper.aiChatPaper

語音守衛:探索多模式大型語言模型的對抗韌性

SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

May 14, 2024
作者: Raghuveer Peri, Sai Muralidhar Jayanthi, Srikanth Ronanki, Anshu Bhatia, Karel Mundnich, Saket Dingliwal, Nilaksh Das, Zejiang Hou, Goeric Huybrechts, Srikanth Vishnubhotla, Daniel Garcia-Romero, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff
cs.AI

摘要

近來,整合語音和大型語言模型(SLMs)以遵循語音指令並生成相關文本回應的能力日益受到青睞。然而,這些模型的安全性和韌性仍然存在著很大的不明確性。在這項研究中,我們調查了這類遵循指令的語音語言模型對敵對攻擊和越獄的潛在弱點。具體來說,我們設計了能夠在白盒和黑盒攻擊環境中生成敵對示例以越獄SLMs的算法,而無需人類參與。此外,我們提出了防範此類越獄攻擊的對策。我們的模型在訓練時使用了帶有語音指令的對話數據,並在口語問答任務上實現了最先進的性能,得分在安全性和幫助性指標上均超過80%。儘管存在安全防護措施,但對越獄的實驗表明SLMs對敵對干擾和轉移攻擊的脆弱性,當在一個涵蓋12個不同有害類別的精心設計的問題數據集上評估時,平均攻擊成功率分別為90%和10%。然而,我們展示了我們提出的對策顯著降低了攻擊成功率。
English
Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. Specifically, we design algorithms that can generate adversarial examples to jailbreak SLMs in both white-box and black-box attack settings without human involvement. Additionally, we propose countermeasures to thwart such jailbreaking attacks. Our models, trained on dialog data with speech instructions, achieve state-of-the-art performance on spoken question-answering task, scoring over 80% on both safety and helpfulness metrics. Despite safety guardrails, experiments on jailbreaking demonstrate the vulnerability of SLMs to adversarial perturbations and transfer attacks, with average attack success rates of 90% and 10% respectively when evaluated on a dataset of carefully designed harmful questions spanning 12 different toxic categories. However, we demonstrate that our proposed countermeasures reduce the attack success significantly.

Summary

AI-Generated Summary

PDF130December 15, 2024