SpeechGuard: マルチモーダル大規模言語モデルの敵対的頑健性の探求

要旨

音声指示に従い関連するテキスト応答を生成できる統合型音声・大規模言語モデル（SLM）が最近注目を集めている。しかし、これらのモデルの安全性と堅牢性は依然として不明な点が多い。本研究では、指示追従型音声言語モデルが敵対的攻撃やジェイルブレイクに対して持つ潜在的な脆弱性を調査する。具体的には、人間の介入なしに、ホワイトボックスおよびブラックボックスの攻撃設定でSLMをジェイルブレイクする敵対的サンプルを生成するアルゴリズムを設計する。さらに、そのようなジェイルブレイク攻撃を防ぐための対策を提案する。音声指示付きの対話データで訓練された我々のモデルは、音声質問応答タスクにおいて最先端の性能を達成し、安全性と有用性の両方の指標で80％以上のスコアを記録した。安全性のガードレールが設けられているにもかかわらず、ジェイルブレイクに関する実験では、SLMが敵対的摂動や転移攻撃に対して脆弱であることが示され、12の異なる有害カテゴリーにわたる慎重に設計された有害な質問のデータセットで評価した場合、平均攻撃成功率はそれぞれ90％と10％であった。しかし、我々が提案する対策により、攻撃成功率が大幅に低下することが実証された。

English

Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. Specifically, we design algorithms that can generate adversarial examples to jailbreak SLMs in both white-box and black-box attack settings without human involvement. Additionally, we propose countermeasures to thwart such jailbreaking attacks. Our models, trained on dialog data with speech instructions, achieve state-of-the-art performance on spoken question-answering task, scoring over 80% on both safety and helpfulness metrics. Despite safety guardrails, experiments on jailbreaking demonstrate the vulnerability of SLMs to adversarial perturbations and transfer attacks, with average attack success rates of 90% and 10% respectively when evaluated on a dataset of carefully designed harmful questions spanning 12 different toxic categories. However, we demonstrate that our proposed countermeasures reduce the attack success significantly.

SpeechGuard: マルチモーダル大規模言語モデルの敵対的頑健性の探求

SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

要旨

Support