拒絕基準:基於情境語言模型中的選擇性拒絕生成評估
RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models
October 12, 2025
作者: Aashiq Muhamed, Leonardo F. R. Ribeiro, Markus Dreyer, Virginia Smith, Mona T. Diab
cs.AI
摘要
在RAG系統中,語言模型基於有缺陷的上下文選擇性拒絕回答的能力對於安全性至關重要,但這仍是一個顯著的失敗點。我們的大規模研究揭示,即便是前沿模型在這種情境下也表現不佳,在多文檔任務上的拒絕準確率降至50%以下,同時表現出危險的過度自信或過度謹慎。靜態基準測試無法可靠地評估這一能力,因為模型會利用數據集特定的偽影並記憶測試實例。我們引入了RefusalBench,這是一種通過受控語言擾動程序化生成診斷測試案例的方法論。我們的框架在六類信息不確定性和三個強度級別上採用了176種不同的擾動策略。對超過30個模型的評估揭示了系統性的失敗模式:拒絕包含可分離的檢測和分類技能,而無論是規模還是擴展推理都無法提升性能。我們發現,選擇性拒絕是一種可訓練的、對齊敏感的能力,為改進提供了明確的路徑。我們發布了兩個基準測試——RefusalBench-NQ(單文檔)和RefusalBench-GaRAGe(多文檔)——以及我們的完整生成框架,以實現對這一關鍵能力的持續動態評估。
English
The ability of language models in RAG systems to selectively refuse to answer
based on flawed context is critical for safety, yet remains a significant
failure point. Our large-scale study reveals that even frontier models struggle
in this setting, with refusal accuracy dropping below 50% on multi-document
tasks, while exhibiting either dangerous overconfidence or overcaution. Static
benchmarks fail to reliably evaluate this capability, as models exploit
dataset-specific artifacts and memorize test instances. We introduce
RefusalBench, a generative methodology that programmatically creates diagnostic
test cases through controlled linguistic perturbation. Our framework employs
176 distinct perturbation strategies across six categories of informational
uncertainty and three intensity levels. Evaluation of over 30 models uncovers
systematic failure patterns: refusal comprises separable detection and
categorization skills, and neither scale nor extended reasoning improves
performance. We find that selective refusal is a trainable, alignment-sensitive
capability, offering a clear path for improvement. We release two benchmarks --
RefusalBench-NQ (single document) and RefusalBench-GaRAGe (multi-document) --
and our complete generation framework to enable continued, dynamic evaluation
of this critical capability.