棄權席：推理型大語言模型在無解問題上的失誤

摘要

為了使大型語言模型（LLMs）在日常及高風險領域中可靠部署，知曉何時不回答與正確回答同等關鍵。現實世界中的用戶查詢可能是不明確、不恰當或根本無法回答的，這要求LLMs能夠對不確定性進行推理並選擇性地棄答——即拒絕給出明確答案。然而，棄答行為仍未被充分研究，缺乏針對現代LLMs的系統性評估框架。在本研究中，我們引入了AbstentionBench，這是一個大規模基準，用於全面評估20個多樣化數據集上的棄答表現，包括未知答案、不確定性、錯誤前提、主觀解釋及過時信息等問題。對20個前沿LLMs的評估顯示，棄答仍是一個未解決的問題，且模型規模的擴大對此幫助甚微。儘管近期在複雜問題解決上，推理型LLMs展現了令人印象深刻的成果，但令人驚訝的是，我們發現推理微調反而會降低棄答能力（平均下降24%），即便是在推理模型明確訓練的數學和科學領域也是如此。我們發現，雖然精心設計的系統提示可以在實踐中提升棄答表現，但這並未解決模型在推理不確定性方面的根本缺陷。我們發布AbstentionBench，旨在促進提升LLM可靠性的研究進展。

English

For Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly. Real-world user queries, which can be underspecified, ill-posed, or fundamentally unanswerable, require LLMs to reason about uncertainty and selectively abstain -- i.e., refuse to answer definitively. However, abstention remains understudied, without a systematic evaluation framework for modern LLMs. In this work, we introduce AbstentionBench, a large-scale benchmark for holistically evaluating abstention across 20 diverse datasets, including questions with unknown answers, underspecification, false premises, subjective interpretations, and outdated information. Evaluating 20 frontier LLMs reveals abstention is an unsolved problem, and one where scaling models is of little use. While recent reasoning LLMs have shown impressive results in complex problem solving, surprisingly, we find that reasoning fine-tuning degrades abstention (by 24% on average), even for math and science domains on which reasoning models are explicitly trained. We find that while a carefully crafted system prompt can boost abstention in practice, it does not resolve models' fundamental inability to reason about uncertainty. We release AbstentionBench to foster research into advancing LLM reliability.

棄權席：推理型大語言模型在無解問題上的失誤

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions

摘要

Support