大型語言模型能預測自身失誤嗎?透過內部迴路實現自我感知
Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits
December 23, 2025
作者: Amirhosein Ghasemabadi, Di Niu
cs.AI
摘要
大型語言模型(LLMs)能生成流暢複雜的輸出,卻常無法識別自身錯誤與幻覺。現有方法通常依賴外部評判器、多樣本一致性或基於文本的自我批判,這些方式要么需額外計算成本,要么與真實正確性關聯微弱。我們提出關鍵問題:LLMs能否透過推論過程中審視內部狀態來預測自身失誤?我們引入Gnosis——一種輕量級自我覺察機制,使凍結參數的LLMs能透過解碼隱藏狀態與注意力模式的訊號,實現內在自我驗證。Gnosis被動觀測內部軌跡,將其壓縮為固定預算的描述符,並以可忽略的推理成本預測正確性,僅增加約500萬參數且運算獨立於序列長度。在數學推理、開放域問答和學術知識基準測試中,針對1.7B至20B參數的凍結主幹模型,Gnosis在準確度與校準度上均持續超越強力內部基線與大型外部評判器。此外,該機制能零樣本泛化至部分生成結果,實現對失敗軌跡的早期檢測與計算感知控制。這些結果表明,可靠的正確性線索內在於生成過程,且無需外部監督即可高效提取。
English
Large language models (LLMs) generate fluent and complex outputs but often fail to recognize their own mistakes and hallucinations. Existing approaches typically rely on external judges, multi-sample consistency, or text-based self-critique, which incur additional compute or correlate weakly with true correctness. We ask: can LLMs predict their own failures by inspecting internal states during inference? We introduce Gnosis, a lightweight self-awareness mechanism that enables frozen LLMs to perform intrinsic self-verification by decoding signals from hidden states and attention patterns. Gnosis passively observes internal traces, compresses them into fixed-budget descriptors, and predicts correctness with negligible inference cost, adding only ~5M parameters and operating independently of sequence length. Across math reasoning, open-domain question answering, and academic knowledge benchmarks, and over frozen backbones ranging from 1.7B to 20B parameters, Gnosis consistently outperforms strong internal baselines and large external judges in both accuracy and calibration. Moreover, it generalizes zero-shot to partial generations, enabling early detection of failing trajectories and compute-aware control. These results show that reliable correctness cues are intrinsic to generation process and can be extracted efficiently without external supervision.