大型語言模型短答與長篇回應間事實一致性(或失準)的奇特案例
The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers
October 13, 2025
作者: Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš
cs.AI
摘要
大型語言模型(LLMs)能夠正確回答「愛因斯坦何時出生?」這樣的問題,但在撰寫關於愛因斯坦生平的內容時卻無法提供相同的日期,這揭示了模型在處理不同任務複雜度時獲取事實知識的根本不一致性。儘管模型在事實問答基準測試中展現出令人印象深刻的準確性,但簡單查詢與複雜查詢之間的可靠性差距仍然未被充分理解,這削弱了其可信度。在本研究中,我們引入了針對事實問答的短長形式對齊(Short-Long Form Alignment for Factual Question Answering, SLAQ),這是一個對比LLMs對同一事實問題在(a)孤立(短形式)與(b)融入複雜查詢(長形式)中回答的受控評估框架。通過觀察16個LLMs在600個查詢上的表現,我們發現對應短查詢與長查詢的答案存在系統性的不對齊。我們進一步揭示了位置依賴的準確性損失和動量效應,即連續正確或錯誤的答案會形成自我強化的模式。通過機制分析,我們發現對齊的事實會激活模型內部的重疊部分,且基於機制相似性的指標能夠以高達78%的準確率預測短長答案的對齊情況。我們的工作確立了查詢複雜度上的事實一致性作為LLMs可信度的重要方面,並挑戰了當前評估實踐,這些實踐隱含地假設了在簡單事實查詢上的良好表現也意味著在更複雜的知識尋求任務中的可靠性。
English
Large language models (LLMs) can correctly answer "When was Einstein born?"
yet fail to provide the same date when writing about Einstein's life revealing
a fundamental inconsistency in how models access factual knowledge across task
complexities. While models display impressive accuracy on factual
question-answering benchmarks, the reliability gap between simple and complex
queries remains poorly understood, eroding their trustworthiness. In this work,
we introduce Short-Long Form Alignment for Factual Question Answering (SLAQ), a
controlled evaluation framework that compares LLMs' answers to the same factual
questions asked (a) in isolation (short) vs. (b) integrated into complex
queries (long). Looking at 16 LLMs across 600 queries, we find a systematic
misalignment of answers to the corresponding short and long queries. We further
uncover position-dependent accuracy loss and momentum effects where consecutive
correct or incorrect answers create self-reinforcing patterns. Through
mechanistic analysis, we find that aligned facts activate overlapping model
internals, and that metrics based on mechanistic similarity can predict
short-long answer alignment with up to 78% accuracy. Our work establishes
factual consistency over query complexity as an important aspect of LLMs'
trustworthiness and challenges current evaluation practices, which implicitly
assume that good performance for simple factual queries implies reliability in
more complex knowledge-seeking tasks too.