ChatPaper.aiChatPaper

EVOREFUSE:面向大语言模型对伪恶意指令过度拒绝的评估与缓解的进化式提示优化

EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions

May 29, 2025
作者: Xiaorui Wu, Xiaofeng Mao, Xin Zhang, Fei Li, Chong Teng, Yuxiang Peng, Li Zheng, Donghong Ji, Zhuang Li
cs.AI

摘要

大型语言模型(LLMs)在面对伪恶意指令时常常拒绝回应:这些语义无害的输入查询因保守的安全对齐机制触发了不必要的模型拒绝,严重影响了用户体验。收集此类指令对于评估和缓解过度拒绝现象至关重要,但现有的指令整理方法,如手动创建或指令改写,要么缺乏可扩展性,要么无法生成足够多样且有效的诱导拒绝提示。为解决这些局限,我们提出了EVOREFUSE,一种提示优化方法,能够生成多样化的伪恶意指令,这些指令能持续引发LLMs的自信拒绝。EVOREFUSE采用进化算法,通过变异策略和重组,在指令空间中探索比现有方法更多样化的方向,并迭代进化种子指令,以最大化LLM拒绝概率的证据下界。利用EVOREFUSE,我们创建了两个新颖的数据集:EVOREFUSE-TEST,一个包含582条伪恶意指令的基准测试集,在9个LLMs上平均拒绝触发率比次优基准高出140.41%,词汇多样性提升34.86%,LLM响应置信度得分提高40.03%;以及EVOREFUSE-ALIGN,提供了3000条带有响应的伪恶意指令,用于监督和基于偏好的对齐训练。在EVOREFUSE-ALIGN上监督微调的LLAMA3.1-8B-INSTRUCT模型,与在次优对齐数据集上训练的模型相比,减少了高达14.31%的过度拒绝,同时不牺牲安全性。我们通过EVOREFUSE-TEST的分析发现,模型过度关注敏感关键词而忽视更广泛的上下文,是触发过度拒绝的主要原因。
English
Large language models (LLMs) frequently refuse to respond to pseudo-malicious instructions: semantically harmless input queries triggering unnecessary LLM refusals due to conservative safety alignment, significantly impairing user experience. Collecting such instructions is crucial for evaluating and mitigating over-refusals, but existing instruction curation methods, like manual creation or instruction rewriting, either lack scalability or fail to produce sufficiently diverse and effective refusal-inducing prompts. To address these limitations, we introduce EVOREFUSE, a prompt optimization approach that generates diverse pseudo-malicious instructions consistently eliciting confident refusals across LLMs. EVOREFUSE employs an evolutionary algorithm exploring the instruction space in more diverse directions than existing methods via mutation strategies and recombination, and iteratively evolves seed instructions to maximize evidence lower bound on LLM refusal probability. Using EVOREFUSE, we create two novel datasets: EVOREFUSE-TEST, a benchmark of 582 pseudo-malicious instructions that outperforms the next-best benchmark with 140.41% higher average refusal triggering rate across 9 LLMs, 34.86% greater lexical diversity, and 40.03% improved LLM response confidence scores; and EVOREFUSE-ALIGN, which provides 3,000 pseudo-malicious instructions with responses for supervised and preference-based alignment training. LLAMA3.1-8B-INSTRUCT supervisedly fine-tuned on EVOREFUSE-ALIGN achieves up to 14.31% fewer over-refusals than models trained on the second-best alignment dataset, without compromising safety. Our analysis with EVOREFUSE-TEST reveals models trigger over-refusals by overly focusing on sensitive keywords while ignoring broader context.
PDF22June 10, 2025