空置的書架還是遺失的鑰匙？記憶提取是參數化事實性的瓶頸（注：標題採用意譯手法，將"Recall"譯為「記憶提取」以契合認知科學隱喻，同時保留「參數化事實性」這一專業術語的準確性。通過疑問句式引發讀者思考，並用「瓶頸」對應原文的"Bottleneck"概念，既保持學術嚴謹性又增強中文表達的流暢度。）

摘要

目前對大型語言模型的事實性評估普遍將所有錯誤等同視之，這掩蓋了失敗究竟源於知識缺失（空置的知識庫）還是已編碼事實的存取受限（遺失的存取鑰匙）。我們提出一種行為分析框架，以「事實」而非「問題」為單位描繪知識輪廓：先判斷每項事實是否已被編碼，再區分其可存取性——完全無法回憶、可直接回憶，或需透過推論運算（思考）方能回憶。為支持此種分析，我們引入WikiProfile基準數據集，該數據集透過以網路搜尋為基礎的LLM自動化流程構建。透過分析13個LLM產生的400萬條回應，我們發現前沿模型在我們的基準測試中已接近知識編碼飽和，GPT-5與Gemini-3對95–98%的事實完成編碼。然而「回憶」仍是主要瓶頸：許多原被歸因於知識缺失的錯誤，實則源於存取失敗。這類失敗具有系統性，且對長尾事實與反向問題影響尤甚。最後我們證實，思考能提升回憶效能並可挽回相當比例的失敗案例，這表明未來的進步可能更依賴於提升模型運用既有編碼知識的方法，而非單純擴大模型規模。

English

Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

摘要

Support