PromptBench: 大規模言語モデルの敵対的プロンプトに対する頑健性評価に向けて

要旨

学術界および産業界における大規模言語モデル（LLMs）への依存度の高まりに伴い、プロンプトに対するそれらの頑健性を包括的に理解することが必要不可欠となっている。この重要なニーズに応えるため、本研究では、LLMsの敵対的プロンプトに対する耐性を測定するための頑健性ベンチマーク「PromptBench」を提案する。本研究では、文字、単語、文、および意味レベルにわたるプロンプトを対象とした多様な敵対的テキスト攻撃を活用し、これらのプロンプトを感情分析、自然言語推論、読解、機械翻訳、数学問題解決などの多様なタスクに適用する。本研究では、4,032の敵対的プロンプトを生成し、8つのタスクと13のデータセットにわたって計567,084のテストサンプルを精緻に評価した。その結果、現代のLLMsが敵対的プロンプトに対して脆弱であることが明らかとなった。さらに、プロンプトの頑健性とその転移性の背後にある謎を理解するための包括的な分析を提示し、研究者および日常ユーザー双方にとって有益なプロンプト構成に関する洞察に満ちた頑健性分析と実践的な提言を提供する。本研究では、敵対的プロンプトを生成するためのコード、プロンプト、および方法論を公開し、この重要な分野における共同探求を促進する。詳細は以下を参照のこと：https://github.com/microsoft/promptbench。

English

The increasing reliance on Large Language Models (LLMs) across academia and industry necessitates a comprehensive understanding of their robustness to prompts. In response to this vital need, we introduce PromptBench, a robustness benchmark designed to measure LLMs' resilience to adversarial prompts. This study uses a plethora of adversarial textual attacks targeting prompts across multiple levels: character, word, sentence, and semantic. These prompts are then employed in diverse tasks, such as sentiment analysis, natural language inference, reading comprehension, machine translation, and math problem-solving. Our study generates 4,032 adversarial prompts, meticulously evaluated over 8 tasks and 13 datasets, with 567,084 test samples in total. Our findings demonstrate that contemporary LLMs are vulnerable to adversarial prompts. Furthermore, we present comprehensive analysis to understand the mystery behind prompt robustness and its transferability. We then offer insightful robustness analysis and pragmatic recommendations for prompt composition, beneficial to both researchers and everyday users. We make our code, prompts, and methodologies to generate adversarial prompts publicly accessible, thereby enabling and encouraging collaborative exploration in this pivotal field: https://github.com/microsoft/promptbench.

PromptBench: 大規模言語モデルの敵対的プロンプトに対する頑健性評価に向けて

PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

要旨

Support