空室中的金牌:通过Camlang诊断大语言模型的元语言推理能力
The Gold Medals in an Empty Room: Diagnosing Metalinguistic Reasoning in LLMs with Camlang
August 30, 2025
作者: Fenghua Liu, Yulong Chen, Yixuan Liu, Zhujun Jin, Solomon Tsai, Ming Zhong
cs.AI
摘要
大型语言模型(LLMs)在众多基准测试中展现出金牌级别的表现,然而这种成功是否真正反映了推理能力还是仅仅基于模式匹配,仍不明确。从认知科学的角度来看,一个具有启发性的测试是模型能否通过显式的元语言演绎学习掌握一门陌生语言,这是一种人类学习者能够可靠地通过元语言推理内化语法体系的学习范式。我们通过Camlang这一新颖的构造语言来探讨这一问题,该语言展示了自然语言中未曾出现过的特征组合。Camlang包含两套显式资源:一本语法书和一本双语词典,它们模拟了成人通过显式语法规则和词汇查找进行第二语言学习的过程,使我们能够区分形态句法、词汇语义及句子层面推理中的错误。人类实验表明,这些资源足以让参与者掌握Camlang并成功解决Camlang任务。为了具体化评估,我们将CommonsenseQA改编为Camlang版本,创建了Camlang-CSQA-v0,这是更广泛任务套件中的首个任务,解决这些问题需要应用语法规则和词汇映射。实验结果显示,GPT-5在英语中达到了98%的精确匹配(EM)准确率,但在Camlang中仅为47%,远低于人类87%的表现,而其他顶尖推理LLMs的表现更差。人类验证进一步揭示,模型的大部分成功源于浅层的词汇对齐,而GPT-5在有限程度上展现出初现的元语言意识,但并未像人类那样系统性地掌握语法。Camlang建立了一个基于认知的评估范式,揭示了当前模型与人类元语言能力之间的根本差距。
English
Large Language Models (LLMs) achieve gold-medal performance across many
benchmarks, yet it remains unclear whether such success reflects genuine
reasoning or pattern matching. From a cognitive science perspective, an
informative test is whether models can master an unfamiliar language through
explicit metalinguistic deductive learning, a paradigm where human learners can
reliably internalise grammatical systems through metalinguistic reasoning. We
address this question with Camlang, a novel constructed language that exhibits
naturalistic yet unattested feature combinations. Camlang consists of two
explicit resources, a grammar book and a bilingual dictionary, which mirror
adult second-language learning via explicit grammar rules and lexical lookup,
and enable us to disentangle errors in morpho-syntax, lexical semantics, and
sentence-level reasoning. Human experiments show that these resources are
sufficient for participants to acquire Camlang and successfully solve Camlang
tasks. To operationalise evaluation, we adapt CommonsenseQA into Camlang,
creating Camlang-CSQA-v0, the first task in a broader suite where solving
questions requires applying grammar rules and lexical mappings. Experimental
results show that GPT-5 achieves 98\% EM accuracy in English but only 47\% in
Camlang, far below human performance at 87\%, while other state-of-the-art
reasoning LLMs perform even worse. Human verification further reveals that most
model successes stem from shallow lexical alignment while GPT-5 shows emerging
metalinguistic awareness to a limited extent but not systematic grammatical
mastery as humans. Camlang establishes a cognitively grounded evaluation
paradigm that exposes fundamental gaps between current models and human
metalinguistic competence.