De Gouden Medailles in een Lege Kamer: Diagnose van Metalinguïstisch Redeneren in LLM's met Camlang

Samenvatting

Grote Taalmodellen (LLMs) behalen gouden-medaille prestaties op vele benchmarks, maar het blijft onduidelijk of dit succes echte redenering of patroonherkenning weerspiegelt. Vanuit een cognitief wetenschappelijk perspectief is een informatief testscenario of modellen een onbekende taal kunnen beheersen via expliciet metalinguïstisch deductief leren, een paradigma waarbij menselijke leerders betrouwbaar grammaticale systemen kunnen internaliseren door metalinguïstisch redeneren. We onderzoeken deze vraag met Camlang, een nieuw geconstrueerde taal die natuurlijke maar niet eerder aangetroffen kenmerkcombinaties vertoont. Camlang bestaat uit twee expliciete bronnen, een grammaticaboek en een tweetalig woordenboek, die het leren van een tweede taal door volwassenen nabootsen via expliciete grammaticaregels en lexicale opzoekingen, en ons in staat stellen om fouten in morfosyntaxis, lexicale semantiek en zinsniveau-redenering te onderscheiden. Menselijke experimenten tonen aan dat deze bronnen voldoende zijn voor deelnemers om Camlang te verwerven en Camlang-taken succesvol op te lossen. Om evaluatie operationeel te maken, passen we CommonsenseQA aan naar Camlang, waardoor Camlang-CSQA-v0 ontstaat, de eerste taak in een bredere reeks waarbij het oplossen van vragen het toepassen van grammaticaregels en lexicale mapping vereist. Experimentele resultaten laten zien dat GPT-5 een nauwkeurigheid van 98% EM in het Engels behaalt, maar slechts 47% in Camlang, ver onder de menselijke prestatie van 87%, terwijl andere state-of-the-art redenerende LLMs nog slechter presteren. Menselijke verificatie onthult verder dat de meeste modelsuccessen voortkomen uit oppervlakkige lexicale afstemming, terwijl GPT-5 in beperkte mate opkomend metalinguïstisch bewustzijn vertoont, maar geen systematische grammaticale beheersing zoals mensen. Camlang stelt een cognitief onderbouwd evaluatieparadigma vast dat fundamentele kloofjes blootlegt tussen huidige modellen en menselijke metalinguïstische competentie.

English

Large Language Models (LLMs) achieve gold-medal performance across many benchmarks, yet it remains unclear whether such success reflects genuine reasoning or pattern matching. From a cognitive science perspective, an informative test is whether models can master an unfamiliar language through explicit metalinguistic deductive learning, a paradigm where human learners can reliably internalise grammatical systems through metalinguistic reasoning. We address this question with Camlang, a novel constructed language that exhibits naturalistic yet unattested feature combinations. Camlang consists of two explicit resources, a grammar book and a bilingual dictionary, which mirror adult second-language learning via explicit grammar rules and lexical lookup, and enable us to disentangle errors in morpho-syntax, lexical semantics, and sentence-level reasoning. Human experiments show that these resources are sufficient for participants to acquire Camlang and successfully solve Camlang tasks. To operationalise evaluation, we adapt CommonsenseQA into Camlang, creating Camlang-CSQA-v0, the first task in a broader suite where solving questions requires applying grammar rules and lexical mappings. Experimental results show that GPT-5 achieves 98\% EM accuracy in English but only 47\% in Camlang, far below human performance at 87\%, while other state-of-the-art reasoning LLMs perform even worse. Human verification further reveals that most model successes stem from shallow lexical alignment while GPT-5 shows emerging metalinguistic awareness to a limited extent but not systematic grammatical mastery as humans. Camlang establishes a cognitively grounded evaluation paradigm that exposes fundamental gaps between current models and human metalinguistic competence.

De Gouden Medailles in een Lege Kamer: Diagnose van Metalinguïstisch Redeneren in LLM's met Camlang

The Gold Medals in an Empty Room: Diagnosing Metalinguistic Reasoning in LLMs with Camlang

Samenvatting

Support