RefusalBench:基于场景的语言模型选择性拒绝生成能力评估
RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models
October 12, 2025
作者: Aashiq Muhamed, Leonardo F. R. Ribeiro, Markus Dreyer, Virginia Smith, Mona T. Diab
cs.AI
摘要
在RAG系统中,语言模型基于有缺陷的上下文选择拒绝回答的能力对安全性至关重要,但这仍是一个显著的薄弱环节。我们的大规模研究表明,即便是前沿模型在此情境下也表现不佳,在多文档任务中的拒绝准确率降至50%以下,同时表现出危险的过度自信或过度谨慎。静态基准测试无法可靠评估这一能力,因为模型会利用数据集特有的伪影并记忆测试实例。我们引入了RefusalBench,一种通过受控语言扰动程序化生成诊断测试用例的方法。我们的框架采用了176种不同的扰动策略,涵盖六类信息不确定性及三个强度等级。对超过30个模型的评估揭示了系统性失败模式:拒绝能力包含可分离的检测与分类技能,而无论是模型规模还是扩展推理都无法提升性能。我们发现,选择性拒绝是一种可训练、对齐敏感的能力,为改进提供了明确路径。我们发布了两个基准测试——RefusalBench-NQ(单文档)和RefusalBench-GaRAGe(多文档)——以及完整的生成框架,以支持对这一关键能力的持续动态评估。
English
The ability of language models in RAG systems to selectively refuse to answer
based on flawed context is critical for safety, yet remains a significant
failure point. Our large-scale study reveals that even frontier models struggle
in this setting, with refusal accuracy dropping below 50% on multi-document
tasks, while exhibiting either dangerous overconfidence or overcaution. Static
benchmarks fail to reliably evaluate this capability, as models exploit
dataset-specific artifacts and memorize test instances. We introduce
RefusalBench, a generative methodology that programmatically creates diagnostic
test cases through controlled linguistic perturbation. Our framework employs
176 distinct perturbation strategies across six categories of informational
uncertainty and three intensity levels. Evaluation of over 30 models uncovers
systematic failure patterns: refusal comprises separable detection and
categorization skills, and neither scale nor extended reasoning improves
performance. We find that selective refusal is a trainable, alignment-sensitive
capability, offering a clear path for improvement. We release two benchmarks --
RefusalBench-NQ (single document) and RefusalBench-GaRAGe (multi-document) --
and our complete generation framework to enable continued, dynamic evaluation
of this critical capability.