弃权席：推理型大语言模型在无解问题上的失效

摘要

为确保大型语言模型（LLMs）在日常及高风险领域中的可靠部署，知晓何时不应作答与正确解答同等关键。现实世界中的用户查询可能表述不清、问题不当或本质上无法解答，这要求LLMs能够对不确定性进行推理，并选择性地回避——即拒绝给出明确答案。然而，关于回避的研究仍显不足，缺乏针对现代LLMs的系统性评估框架。本研究引入了AbstentionBench，一个大规模基准测试，旨在全面评估LLMs在20个多样化数据集上的回避能力，涵盖未知答案、表述模糊、错误前提、主观解读及过时信息等问题。通过对20个前沿LLMs的评估，我们发现回避仍是一个未解难题，且模型规模的扩大对此帮助甚微。尽管近期在复杂问题解决上，推理型LLMs展现了令人瞩目的成果，但令人意外的是，我们发现推理微调反而削弱了回避能力（平均下降24%），即便是在数学和科学这些推理模型明确训练的领域也是如此。我们发现，虽然精心设计的系统提示能在实践中提升回避表现，但它并未解决模型在不确定性推理上的根本缺陷。我们发布AbstentionBench，旨在推动提升LLM可靠性的研究进展。

English

For Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly. Real-world user queries, which can be underspecified, ill-posed, or fundamentally unanswerable, require LLMs to reason about uncertainty and selectively abstain -- i.e., refuse to answer definitively. However, abstention remains understudied, without a systematic evaluation framework for modern LLMs. In this work, we introduce AbstentionBench, a large-scale benchmark for holistically evaluating abstention across 20 diverse datasets, including questions with unknown answers, underspecification, false premises, subjective interpretations, and outdated information. Evaluating 20 frontier LLMs reveals abstention is an unsolved problem, and one where scaling models is of little use. While recent reasoning LLMs have shown impressive results in complex problem solving, surprisingly, we find that reasoning fine-tuning degrades abstention (by 24% on average), even for math and science domains on which reasoning models are explicitly trained. We find that while a carefully crafted system prompt can boost abstention in practice, it does not resolve models' fundamental inability to reason about uncertainty. We release AbstentionBench to foster research into advancing LLM reliability.

弃权席：推理型大语言模型在无解问题上的失效

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions

摘要

Support