PromptBench:评估大型语言模型在对抗性提示上的鲁棒性
PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
June 7, 2023
作者: Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Neil Zhenqiang Gong, Yue Zhang, Xing Xie
cs.AI
摘要
随着学术界和工业界对大型语言模型(LLMs)的日益依赖,迫使我们全面了解它们对提示的鲁棒性。为了满足这一重要需求,我们引入了PromptBench,一个旨在衡量LLMs对对抗性提示的抗性的鲁棒性基准。本研究使用了大量针对不同级别的提示的对抗性文本攻击:字符、单词、句子和语义。这些提示随后被应用于各种任务,如情感分析、自然语言推理、阅读理解、机器翻译和数学问题解决。我们的研究生成了4,032个对抗性提示,经过细致评估,涵盖了8个任务和13个数据集,总共有567,084个测试样本。我们的发现表明,当代LLMs对对抗性提示是脆弱的。此外,我们提供了全面的分析,以了解提示鲁棒性及其可转移性背后的奥秘。然后,我们提供了深入的鲁棒性分析和实用的提示构成建议,对研究人员和普通用户都有益。我们将我们的代码、提示和生成对抗性提示的方法公开,以便促进和鼓励在这一关键领域的协作探索:https://github.com/microsoft/promptbench。
English
The increasing reliance on Large Language Models (LLMs) across academia and
industry necessitates a comprehensive understanding of their robustness to
prompts. In response to this vital need, we introduce PromptBench, a robustness
benchmark designed to measure LLMs' resilience to adversarial prompts. This
study uses a plethora of adversarial textual attacks targeting prompts across
multiple levels: character, word, sentence, and semantic. These prompts are
then employed in diverse tasks, such as sentiment analysis, natural language
inference, reading comprehension, machine translation, and math
problem-solving. Our study generates 4,032 adversarial prompts, meticulously
evaluated over 8 tasks and 13 datasets, with 567,084 test samples in total. Our
findings demonstrate that contemporary LLMs are vulnerable to adversarial
prompts. Furthermore, we present comprehensive analysis to understand the
mystery behind prompt robustness and its transferability. We then offer
insightful robustness analysis and pragmatic recommendations for prompt
composition, beneficial to both researchers and everyday users. We make our
code, prompts, and methodologies to generate adversarial prompts publicly
accessible, thereby enabling and encouraging collaborative exploration in this
pivotal field: https://github.com/microsoft/promptbench.