空置的書架還是遺失的鑰匙?記憶提取是參數化事實性的瓶頸 (注:標題採用意譯手法,將"Recall"譯為「記憶提取」以契合認知科學隱喻,同時保留「參數化事實性」這一專業術語的準確性。通過疑問句式引發讀者思考,並用「瓶頸」對應原文的"Bottleneck"概念,既保持學術嚴謹性又增強中文表達的流暢度。)
Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality
February 15, 2026
作者: Nitay Calderon, Eyal Ben-David, Zorik Gekhman, Eran Ofek, Gal Yona
cs.AI
摘要
目前對大型語言模型的事實性評估普遍將所有錯誤等同視之,這掩蓋了失敗究竟源於知識缺失(空置的知識庫)還是已編碼事實的存取受限(遺失的存取鑰匙)。我們提出一種行為分析框架,以「事實」而非「問題」為單位描繪知識輪廓:先判斷每項事實是否已被編碼,再區分其可存取性——完全無法回憶、可直接回憶,或需透過推論運算(思考)方能回憶。為支持此種分析,我們引入WikiProfile基準數據集,該數據集透過以網路搜尋為基礎的LLM自動化流程構建。透過分析13個LLM產生的400萬條回應,我們發現前沿模型在我們的基準測試中已接近知識編碼飽和,GPT-5與Gemini-3對95–98%的事實完成編碼。然而「回憶」仍是主要瓶頸:許多原被歸因於知識缺失的錯誤,實則源於存取失敗。這類失敗具有系統性,且對長尾事實與反向問題影響尤甚。最後我們證實,思考能提升回憶效能並可挽回相當比例的失敗案例,這表明未來的進步可能更依賴於提升模型運用既有編碼知識的方法,而非單純擴大模型規模。
English
Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.