空置的货架还是遗失的钥匙？参数化事实性的瓶颈在于召回能力（注：译文采用意译手法，将原标题的隐喻转化为更符合中文技术论文表达习惯的句式。"Recall"在机器学习领域标准译法为"召回率/召回能力"，此处根据上下文选择"召回能力"以保持概念准确性；"Parametric Factuality"译为"参数化事实性"以保持技术术语统一性；通过设问句式保留原文的探讨语气，同时用"瓶颈"对应"Bottleneck"的专业表述。）

摘要

当前对大型语言模型的事实性评估将所有错误等同视之，难以区分错误究竟源于知识缺失（空置书架）还是已有知识的提取障碍（遗失钥匙）。我们提出一种基于事实而非问题层面的行为分析框架，通过两个维度刻画每个事实的特性：首先是知识是否被编码，其次是其可及性程度——完全无法回忆、可直接回忆，或需借助推理计算（思考）才能回忆。为支持此类分析，我们开发了WikiProfile基准测试集，该数据集通过基于网络搜索的提示型LLM自动构建而成。基于13个LLM生成的400万条响应分析发现：前沿模型在我们的基准集上知识编码已接近饱和，GPT-5和Gemini-3对事实的编码率达到95%-98%。然而知识提取仍是主要瓶颈——许多曾被归因于知识缺失的错误实则源于提取失败。这些提取失败具有系统性特征，且对长尾事实和逆向问题的影响尤为显著。最后我们证明，思考机制能提升回忆效能，可挽回相当比例的提取失败案例，这表明未来性能提升可能更依赖于改进模型对已编码知识的利用方式，而非单纯扩大模型规模。

English

Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

摘要

Support