Kunnen LLM's introspectie toepassen? Een realiteitstoets

Samenvatting

Kunnen grote taalmodellen hun eigen interne toestanden detecteren en rapporteren? Verschillende studies hebben betoogd dat het antwoord op deze vraag ja is. Wij stellen, op basis van lessen uit onderzoek naar menselijke metacognitie, dat deze conclusie voorbarig kan zijn: om van deze conclusie overtuigd te zijn, moeten we onderscheid maken tussen echte introspectie en patroonherkenning op basis van oppervlakkige aanwijzingen. Verder betogen we dat gedragsmatige evidentie op zichzelf inherent ontoereikend is om sterke introspectieve claims te staven. Wij onderzoeken twee recent geïntroduceerde evaluatieparadigma's in het licht van deze overweging. In het eerste paradigma wordt van modellen verwacht dat ze detecteren of hun interne toestanden zijn gemanipuleerd. We vinden dat modellen dergelijke interventies op hun interne toestanden niet betrouwbaar kunnen onderscheiden van manipulaties van de invoer, wat suggereert dat hun succes in de oorspronkelijke studies veeleer hun vermogen weerspiegelt om anomalieën in het algemeen te detecteren, in plaats van specifiek interventies op hun interne toestanden. In het tweede paradigma dat we onderzoeken, krijgen modellen de taak om labels te voorspellen die zijn afgeleid van hun eigen verborgen toestanden. Hier vinden we dat classificatoren die alleen toegang hebben tot de invoer een gelijkwaardige prestatie leveren als de eigen in-context voorspellingen van het model, wat aangeeft dat de oorspronkelijke resultaten niet doorslaggevend aantonen dat het model geprivilegieerde toegang heeft tot zijn interne representaties. We introduceren verder een geherlabelde controle-omgeving, waarin modellen niet kunnen vertrouwen op de semantiek van de taak om deze op te lossen, maar in plaats daarvan moeten vertrouwen op de interne representatie; modellen presteren dichter bij kansniveau in deze beter gecontroleerde versie van de taak. Samengenomen wijzen deze resultaten erop dat de huidige evidentie onvoldoende is om vast te stellen dat grote taalmodellen metacognitieve monitoring vertonen.

English

Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues. Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims. We re-examine two recently introduced evaluation paradigms in light of this consideration. In the first paradigm, models are expected to detect whether their internal states have been tampered with. We find that models cannot reliably distinguish such interventions on their internal states from manipulations of the input, suggesting that their success in the original studies reflects their ability to detect anomalies more generally, as opposed to interventions on their internal states in particular. In the second paradigm we examine, models are tasked with predicting labels derived from their own hidden states. Here, we find that classifiers that only have access to the input achieve equivalent performance to the model's own in-context predictions, indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. We further introduce a relabeled control setting, where models cannot rely on the semantics of the task to solve it, and instead must rely on the internal representation; models perform closer to chance on this better-controlled version of the task. Taken together, these results indicate that current evidence is insufficient to establish that LLMs display metacognitive monitoring.