LLM이 내성할 수 있는가? 현실 점검

초록

대규모 언어 모델이 자신의 내부 상태를 탐지하고 보고할 수 있을까? 여러 연구들은 이 질문에 대한 답이 '그렇다'고 주장해 왔다. 우리는 인간의 메타인지 연구에서 얻은 교훈에 기반하여, 이 결론이 성급할 수 있다고 주장한다. 즉, 이 결론을 확신하기 위해서는 진정한 내성과 표면적 단서에 기반한 패턴 매칭을 구별해야 한다. 또한, 행동 증거만으로는 강한 내성적 주장을 확립하기에 본질적으로 충분하지 않다고 주장한다. 우리는 이러한 고려 사항을 염두에 두고 최근에 도입된 두 가지 평가 패러다임을 재검토한다. 첫 번째 패러다임에서 모델은 자신의 내부 상태가 조작되었는지 여부를 탐지해야 한다. 우리는 모델이 내부 상태에 대한 이러한 개입과 입력의 조작을 신뢰할 수 있게 구별하지 못한다는 사실을 발견했다. 이는 원래 연구에서의 성공이 모델이 내부 상태에 대한 개입보다는 일반적으로 이상 징후를 탐지하는 능력을 반영한다는 것을 시사한다. 우리가 검토하는 두 번째 패러다임에서 모델은 자신의 은닉 상태에서 파생된 레이블을 예측하는 과제를 수행한다. 여기서 우리는 입력에만 접근할 수 있는 분류기가 모델 자체의 맥락 내 예측과 동등한 성능을 달성한다는 사실을 발견했다. 이는 원래 결과가 모델이 자신의 내부 표현에 대한 특권적 접근을 가지고 있다는 것을 결정적으로 입증하지 못함을 나타낸다. 또한, 우리는 모델이 과제의 의미론에 의존할 수 없고 대신 내부 표현에 의존해야 하는 재레이블링된 통제 설정을 도입한다. 이 더 잘 통제된 버전의 과제에서 모델은 우연 수준에 더 가깝게 수행한다. 종합하면, 이러한 결과는 현재의 증거가 LLM이 메타인지적 모니터링을 보여준다는 것을 확립하기에 불충분함을 나타낸다.

English

Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues. Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims. We re-examine two recently introduced evaluation paradigms in light of this consideration. In the first paradigm, models are expected to detect whether their internal states have been tampered with. We find that models cannot reliably distinguish such interventions on their internal states from manipulations of the input, suggesting that their success in the original studies reflects their ability to detect anomalies more generally, as opposed to interventions on their internal states in particular. In the second paradigm we examine, models are tasked with predicting labels derived from their own hidden states. Here, we find that classifiers that only have access to the input achieve equivalent performance to the model's own in-context predictions, indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. We further introduce a relabeled control setting, where models cannot rely on the semantics of the task to solve it, and instead must rely on the internal representation; models perform closer to chance on this better-controlled version of the task. Taken together, these results indicate that current evidence is insufficient to establish that LLMs display metacognitive monitoring.