AbstentionBench: 回答不能な質問に対する推論LLMの失敗

要旨

大規模言語モデル（LLMs）を日常的および高リスクの領域で確実に展開するためには、正しく回答することと同様に、回答しないタイミングを知ることが極めて重要である。現実世界のユーザークエリは、不十分に指定されていたり、不適切に設定されていたり、根本的に回答不可能な場合があり、LLMsは不確実性について推論し、選択的に回答を控える（すなわち、明確に回答することを拒否する）必要がある。しかし、回答控えに関する研究は未だ十分ではなく、現代のLLMsに対する体系的な評価フレームワークが存在しない。本研究では、AbstentionBenchを導入する。これは、未知の回答、不十分な指定、誤った前提、主観的な解釈、および時代遅れの情報を含む20の多様なデータセットにわたって回答控えを包括的に評価する大規模なベンチマークである。20の最先端LLMsを評価した結果、回答控えは未解決の問題であり、モデルのスケーリングがほとんど役に立たないことが明らかになった。最近の推論LLMsは複雑な問題解決において印象的な結果を示しているが、驚くべきことに、推論のファインチューニングは回答控えを劣化させ（平均24%）、推論モデルが明示的に訓練された数学や科学の領域においてさえも同様であることがわかった。慎重に設計されたシステムプロンプトは実践的に回答控えを向上させることができるが、モデルが不確実性について推論する根本的な能力を解決することはできない。我々は、LLMの信頼性を向上させる研究を促進するためにAbstentionBenchを公開する。

English

For Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly. Real-world user queries, which can be underspecified, ill-posed, or fundamentally unanswerable, require LLMs to reason about uncertainty and selectively abstain -- i.e., refuse to answer definitively. However, abstention remains understudied, without a systematic evaluation framework for modern LLMs. In this work, we introduce AbstentionBench, a large-scale benchmark for holistically evaluating abstention across 20 diverse datasets, including questions with unknown answers, underspecification, false premises, subjective interpretations, and outdated information. Evaluating 20 frontier LLMs reveals abstention is an unsolved problem, and one where scaling models is of little use. While recent reasoning LLMs have shown impressive results in complex problem solving, surprisingly, we find that reasoning fine-tuning degrades abstention (by 24% on average), even for math and science domains on which reasoning models are explicitly trained. We find that while a carefully crafted system prompt can boost abstention in practice, it does not resolve models' fundamental inability to reason about uncertainty. We release AbstentionBench to foster research into advancing LLM reliability.

AbstentionBench: 回答不能な質問に対する推論LLMの失敗

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions

要旨

Support