大規模言語モデルは内省できるか？現実検証

要旨

大規模言語モデルは自身の内部状態を検出し報告できるのか。これまでの複数の研究では、この問いに対する答えが「はい」であると論じてきた。しかし本稿では、人間のメタ認知研究からの教訓に基づき、この結論は尚早である可能性を指摘する。すなわち、この結論を確信するためには、真の内省と表面的な手がかりに基づくパターンマッチングとを区別する必要がある。さらに、行動的証拠のみでは、強い内省的主張を立証するのに本質的に不十分であると論じる。本稿では、この観点から最近導入された二つの評価パラダイムを再検討する。第一のパラダイムでは、モデルは自身の内部状態が改ざんされたかどうかを検出することが期待される。モデルは、自身の内部状態への介入と入力への操作とを確実に区別できないことが判明した。このことは、元の研究での成功が、内部状態への介入そのものではなく、より一般的に異常を検出する能力を反映していることを示唆する。検討した第二のパラダイムでは、モデルは自身の隠れ状態から導出されたラベルを予測する課題を与えられる。ここで、入力のみにアクセス可能な分類器がモデル自身の文脈内予測と同等の性能を達成することが明らかになった。これは、元の結果がモデルが内部表現への特権的アクセスを有することを決定的に示していないことを示す。さらに、再ラベル付けされた対照設定を導入し、モデルが課題の意味論に頼らず、代わりに内部表現に依存せざるを得ないようにした。このより適切に制御されたバージョンの課題では、モデルの性能は偶然の水準に近づく。これらの結果を総合すると、現状のエビデンスは、大規模言語モデルがメタ認知モニタリングを示すことを立証するには不十分であることが示唆される。

English

Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues. Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims. We re-examine two recently introduced evaluation paradigms in light of this consideration. In the first paradigm, models are expected to detect whether their internal states have been tampered with. We find that models cannot reliably distinguish such interventions on their internal states from manipulations of the input, suggesting that their success in the original studies reflects their ability to detect anomalies more generally, as opposed to interventions on their internal states in particular. In the second paradigm we examine, models are tasked with predicting labels derived from their own hidden states. Here, we find that classifiers that only have access to the input achieve equivalent performance to the model's own in-context predictions, indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. We further introduce a relabeled control setting, where models cannot rely on the semantics of the task to solve it, and instead must rely on the internal representation; models perform closer to chance on this better-controlled version of the task. Taken together, these results indicate that current evidence is insufficient to establish that LLMs display metacognitive monitoring.