QGuard：基于问题的零样本防护机制，保障多模态大语言模型安全

摘要

近期，大型语言模型（LLMs）的显著进展已对从通用领域到专业领域的广泛范围产生了深远影响。然而，这些进步也极大地增加了恶意用户利用有害及越狱提示进行恶意攻击的可能性。尽管已有诸多努力致力于防范有害提示与越狱提示，保护LLMs免受此类恶意攻击仍是一项重要且具挑战性的任务。本文提出QGuard，一种简洁而有效的安全防护方法，它通过问题提示以零样本方式阻断有害提示。我们的方法不仅能防御基于文本的有害提示，还能抵御多模态有害提示攻击。此外，通过多样化与修改防护问题，我们的方法无需微调即可保持对最新有害提示的鲁棒性。实验结果表明，我们的模型在纯文本与多模态有害数据集上均展现出竞争力。同时，通过对问题提示的分析，我们实现了对用户输入的白盒分析。我们相信，该方法为现实世界中的LLM服务在缓解有害提示相关的安全风险方面提供了宝贵的洞见。

English

The recent advancements in Large Language Models(LLMs) have had a significant impact on a wide range of fields, from general domains to specialized areas. However, these advancements have also significantly increased the potential for malicious users to exploit harmful and jailbreak prompts for malicious attacks. Although there have been many efforts to prevent harmful prompts and jailbreak prompts, protecting LLMs from such malicious attacks remains an important and challenging task. In this paper, we propose QGuard, a simple yet effective safety guard method, that utilizes question prompting to block harmful prompts in a zero-shot manner. Our method can defend LLMs not only from text-based harmful prompts but also from multi-modal harmful prompt attacks. Moreover, by diversifying and modifying guard questions, our approach remains robust against the latest harmful prompts without fine-tuning. Experimental results show that our model performs competitively on both text-only and multi-modal harmful datasets. Additionally, by providing an analysis of question prompting, we enable a white-box analysis of user inputs. We believe our method provides valuable insights for real-world LLM services in mitigating security risks associated with harmful prompts.

QGuard：基于问题的零样本防护机制，保障多模态大语言模型安全

QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety

摘要

Support