大型语言模型能否内省？一次现实检验

摘要

大型语言模型能否检测并报告其自身的内部状态？已有诸多研究断言答案为肯定。我们基于人类元认知研究的经验指出，这一结论可能为时过早：要确信该结论，需区分真正的内省与基于表层线索的模式匹配。此外，我们认为仅凭行为证据本身不足以支撑强内省主张。基于此考量，我们重新审视了近期引入的两种评估范式。在第一种范式中，模型需检测其内部状态是否被篡改。我们发现，模型无法可靠地区分此类针对内部状态的干预与对输入的操纵，这表明其在原始研究中的成功更可能反映其检测异常的一般能力，而非特别针对内部状态的干预。在考察的第二种范式中，模型需预测由其自身隐藏状态衍生的标签。我们发现，仅能访问输入的分类器即可达到与模型自身上下文预测相当的性能，说明原始结果并未确凿证明模型对其内部表征拥有特权访问。我们进一步引入重标签控制设置，使模型无法依赖任务语义进行求解，而必须依靠内部表征；在此改良控制版本的任务中，模型表现近乎随机。综合来看，这些结果表明现有证据尚不足以证明LLM具备元认知监控能力。

English

Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues. Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims. We re-examine two recently introduced evaluation paradigms in light of this consideration. In the first paradigm, models are expected to detect whether their internal states have been tampered with. We find that models cannot reliably distinguish such interventions on their internal states from manipulations of the input, suggesting that their success in the original studies reflects their ability to detect anomalies more generally, as opposed to interventions on their internal states in particular. In the second paradigm we examine, models are tasked with predicting labels derived from their own hidden states. Here, we find that classifiers that only have access to the input achieve equivalent performance to the model's own in-context predictions, indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. We further introduce a relabeled control setting, where models cannot rely on the semantics of the task to solve it, and instead must rely on the internal representation; models perform closer to chance on this better-controlled version of the task. Taken together, these results indicate that current evidence is insufficient to establish that LLMs display metacognitive monitoring.