大型語言模型能否內省？一項現實檢驗

摘要

大型語言模型能否偵測並回報自身的內部狀態？諸多研究主張此問題的答案為肯定。我們根據人類後設認知研究的教訓認為，這項結論可能言之過早：若欲確信此結論，必須區分真正的內省與基於表面線索的模式匹配。此外，我們主張僅憑行為證據本身無法充分證實強烈的內省論述。基於此觀點，我們重新審視近期提出的兩種評估典範。在第一種典範中，模型需判斷其內部狀態是否遭受竄改。我們發現模型無法可靠區分這類對內部狀態的干預與對輸入的操弄，顯示其在原始研究中的成功實際反映的是模型偵測異常事件的普遍能力，而非專門針對內部狀態的干預。在第二種檢驗的典範中，模型需預測源自自身隱藏狀態的標籤。此處我們發現，僅能存取輸入資料的分類器，其表現與模型自身的語境內預測不相上下，顯示原始結果未能確鑿證明模型對其內部表徵具有特權存取。我們進一步引入重新標記的控制情境，使模型無法仰賴任務語義解決問題，而必須依賴內部表徵；在此經過更佳控制的任務版本中，模型表現趨近於隨機水準。整體而言，這些結果顯示現有證據尚不足以證明大型語言模型具備後設認知監控能力。

English

Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues. Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims. We re-examine two recently introduced evaluation paradigms in light of this consideration. In the first paradigm, models are expected to detect whether their internal states have been tampered with. We find that models cannot reliably distinguish such interventions on their internal states from manipulations of the input, suggesting that their success in the original studies reflects their ability to detect anomalies more generally, as opposed to interventions on their internal states in particular. In the second paradigm we examine, models are tasked with predicting labels derived from their own hidden states. Here, we find that classifiers that only have access to the input achieve equivalent performance to the model's own in-context predictions, indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. We further introduce a relabeled control setting, where models cannot rely on the semantics of the task to solve it, and instead must rely on the internal representation; models perform closer to chance on this better-controlled version of the task. Taken together, these results indicate that current evidence is insufficient to establish that LLMs display metacognitive monitoring.