EVOREFUSE:面向大语言模型对伪恶意指令过度拒绝的评估与缓解的进化式提示优化
EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions
May 29, 2025
作者: Xiaorui Wu, Xiaofeng Mao, Xin Zhang, Fei Li, Chong Teng, Yuxiang Peng, Li Zheng, Donghong Ji, Zhuang Li
cs.AI
摘要
大型语言模型(LLMs)在面对伪恶意指令时常常拒绝回应:这些语义无害的输入查询因保守的安全对齐机制触发了不必要的模型拒绝,严重影响了用户体验。收集此类指令对于评估和缓解过度拒绝现象至关重要,但现有的指令整理方法,如手动创建或指令改写,要么缺乏可扩展性,要么无法生成足够多样且有效的诱导拒绝提示。为解决这些局限,我们提出了EVOREFUSE,一种提示优化方法,能够生成多样化的伪恶意指令,这些指令能持续引发LLMs的自信拒绝。EVOREFUSE采用进化算法,通过变异策略和重组,在指令空间中探索比现有方法更多样化的方向,并迭代进化种子指令,以最大化LLM拒绝概率的证据下界。利用EVOREFUSE,我们创建了两个新颖的数据集:EVOREFUSE-TEST,一个包含582条伪恶意指令的基准测试集,在9个LLMs上平均拒绝触发率比次优基准高出140.41%,词汇多样性提升34.86%,LLM响应置信度得分提高40.03%;以及EVOREFUSE-ALIGN,提供了3000条带有响应的伪恶意指令,用于监督和基于偏好的对齐训练。在EVOREFUSE-ALIGN上监督微调的LLAMA3.1-8B-INSTRUCT模型,与在次优对齐数据集上训练的模型相比,减少了高达14.31%的过度拒绝,同时不牺牲安全性。我们通过EVOREFUSE-TEST的分析发现,模型过度关注敏感关键词而忽视更广泛的上下文,是触发过度拒绝的主要原因。
English
Large language models (LLMs) frequently refuse to respond to pseudo-malicious
instructions: semantically harmless input queries triggering unnecessary LLM
refusals due to conservative safety alignment, significantly impairing user
experience. Collecting such instructions is crucial for evaluating and
mitigating over-refusals, but existing instruction curation methods, like
manual creation or instruction rewriting, either lack scalability or fail to
produce sufficiently diverse and effective refusal-inducing prompts. To address
these limitations, we introduce EVOREFUSE, a prompt optimization approach that
generates diverse pseudo-malicious instructions consistently eliciting
confident refusals across LLMs. EVOREFUSE employs an evolutionary algorithm
exploring the instruction space in more diverse directions than existing
methods via mutation strategies and recombination, and iteratively evolves seed
instructions to maximize evidence lower bound on LLM refusal probability. Using
EVOREFUSE, we create two novel datasets: EVOREFUSE-TEST, a benchmark of 582
pseudo-malicious instructions that outperforms the next-best benchmark with
140.41% higher average refusal triggering rate across 9 LLMs, 34.86% greater
lexical diversity, and 40.03% improved LLM response confidence scores; and
EVOREFUSE-ALIGN, which provides 3,000 pseudo-malicious instructions with
responses for supervised and preference-based alignment training.
LLAMA3.1-8B-INSTRUCT supervisedly fine-tuned on EVOREFUSE-ALIGN achieves up to
14.31% fewer over-refusals than models trained on the second-best alignment
dataset, without compromising safety. Our analysis with EVOREFUSE-TEST reveals
models trigger over-refusals by overly focusing on sensitive keywords while
ignoring broader context.