QGuard：基於問題的多模態大語言模型零樣本安全防護

摘要

大型語言模型（LLMs）的最新進展對從通用領域到專業領域的廣泛範圍產生了重大影響。然而，這些進展也顯著增加了惡意用戶利用有害和越獄提示進行惡意攻擊的可能性。儘管已有許多努力來防止有害提示和越獄提示，保護LLMs免受此類惡意攻擊仍然是一項重要且具挑戰性的任務。在本文中，我們提出了QGuard，一種簡單而有效的安全防護方法，利用問題提示以零樣本方式阻擋有害提示。我們的方法不僅能防禦基於文本的有害提示，還能防禦多模態有害提示攻擊。此外，通過多樣化和修改防護問題，我們的方法在無需微調的情況下仍能對抗最新的有害提示。實驗結果顯示，我們的模型在純文本和多模態有害數據集上均表現出色。此外，通過提供對問題提示的分析，我們實現了對用戶輸入的白盒分析。我們相信，我們的方法為現實世界中的LLM服務在減輕與有害提示相關的安全風險方面提供了寶貴的見解。

English

The recent advancements in Large Language Models(LLMs) have had a significant impact on a wide range of fields, from general domains to specialized areas. However, these advancements have also significantly increased the potential for malicious users to exploit harmful and jailbreak prompts for malicious attacks. Although there have been many efforts to prevent harmful prompts and jailbreak prompts, protecting LLMs from such malicious attacks remains an important and challenging task. In this paper, we propose QGuard, a simple yet effective safety guard method, that utilizes question prompting to block harmful prompts in a zero-shot manner. Our method can defend LLMs not only from text-based harmful prompts but also from multi-modal harmful prompt attacks. Moreover, by diversifying and modifying guard questions, our approach remains robust against the latest harmful prompts without fine-tuning. Experimental results show that our model performs competitively on both text-only and multi-modal harmful datasets. Additionally, by providing an analysis of question prompting, we enable a white-box analysis of user inputs. We believe our method provides valuable insights for real-world LLM services in mitigating security risks associated with harmful prompts.

QGuard：基於問題的多模態大語言模型零樣本安全防護

QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety

摘要

Support