EVOREFUSE: 疑似悪意のある指示に対するLLMの過剰拒否の評価と緩和のための進化的プロンプト最適化

要旨

大規模言語モデル（LLM）は、擬似悪意のある指示に対して頻繁に応答を拒否します。これは、保守的な安全性アラインメントにより、意味的には無害な入力クエリが不必要なLLMの拒否を引き起こし、ユーザーエクスペリエンスを著しく損なうためです。このような指示を収集することは、過剰な拒否を評価し緩和するために重要ですが、既存の指示キュレーション方法（手動作成や指示の書き換えなど）は、スケーラビリティに欠けるか、十分に多様で効果的な拒否を誘発するプロンプトを生成できません。これらの制限に対処するため、我々はEVOREFUSEを導入します。これは、多様な擬似悪意のある指示を生成し、LLM間で一貫して自信を持った拒否を引き起こすプロンプト最適化アプローチです。EVOREFUSEは、突然変異戦略と組み換えにより、既存の方法よりも多様な方向で指示空間を探索する進化的アルゴリズムを使用し、LLMの拒否確率の証拠下限を最大化するためにシード指示を反復的に進化させます。EVOREFUSEを使用して、我々は2つの新しいデータセットを作成しました。EVOREFUSE-TESTは、582の擬似悪意のある指示のベンチマークで、9つのLLMで140.41%高い平均拒否トリガー率、34.86%高い語彙的多様性、40.03%改善されたLLM応答信頼度スコアを達成し、次善のベンチマークを上回ります。また、EVOREFUSE-ALIGNは、教師ありおよび選好ベースのアラインメントトレーニングのための3,000の擬似悪意のある指示と応答を提供します。EVOREFUSE-ALIGNで教師あり微調整されたLLAMA3.1-8B-INSTRUCTは、安全性を損なうことなく、次善のアラインメントデータセットでトレーニングされたモデルよりも最大14.31%少ない過剰拒否を達成します。EVOREFUSE-TESTを用いた分析により、モデルが広範な文脈を無視して敏感なキーワードに過度に焦点を当てることが過剰拒否を引き起こすことが明らかになりました。

English

Large language models (LLMs) frequently refuse to respond to pseudo-malicious instructions: semantically harmless input queries triggering unnecessary LLM refusals due to conservative safety alignment, significantly impairing user experience. Collecting such instructions is crucial for evaluating and mitigating over-refusals, but existing instruction curation methods, like manual creation or instruction rewriting, either lack scalability or fail to produce sufficiently diverse and effective refusal-inducing prompts. To address these limitations, we introduce EVOREFUSE, a prompt optimization approach that generates diverse pseudo-malicious instructions consistently eliciting confident refusals across LLMs. EVOREFUSE employs an evolutionary algorithm exploring the instruction space in more diverse directions than existing methods via mutation strategies and recombination, and iteratively evolves seed instructions to maximize evidence lower bound on LLM refusal probability. Using EVOREFUSE, we create two novel datasets: EVOREFUSE-TEST, a benchmark of 582 pseudo-malicious instructions that outperforms the next-best benchmark with 140.41% higher average refusal triggering rate across 9 LLMs, 34.86% greater lexical diversity, and 40.03% improved LLM response confidence scores; and EVOREFUSE-ALIGN, which provides 3,000 pseudo-malicious instructions with responses for supervised and preference-based alignment training. LLAMA3.1-8B-INSTRUCT supervisedly fine-tuned on EVOREFUSE-ALIGN achieves up to 14.31% fewer over-refusals than models trained on the second-best alignment dataset, without compromising safety. Our analysis with EVOREFUSE-TEST reveals models trigger over-refusals by overly focusing on sensitive keywords while ignoring broader context.

EVOREFUSE: 疑似悪意のある指示に対するLLMの過剰拒否の評価と緩和のための進化的プロンプト最適化

EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions

要旨

Support