空っぽの部屋の金メダル：Camlangを用いたLLMのメタ言語的推論の診断

要旨

大規模言語モデル（LLMs）は多くのベンチマークで金メダル級の性能を達成しているが、その成功が真の推論能力を反映しているのか、あるいは単なるパターンマッチングに過ぎないのかは依然として不明である。認知科学の観点から、有益なテストは、モデルが明示的なメタ言語的演繹学習を通じて未知の言語を習得できるかどうかである。このパラダイムでは、人間の学習者はメタ言語的推論を通じて文法的体系を確実に内在化することができる。本研究では、Camlangという、自然言語的でありながら未確認の特徴の組み合わせを示す新規の人工言語を用いてこの問題に取り組む。Camlangは、文法書と二言語辞書という2つの明示的なリソースで構成されており、これらは明示的な文法規則と語彙検索を通じた成人の第二言語学習を模倣し、形態統語論、語彙意味論、文レベルの推論における誤りを分離することを可能にする。人間による実験では、これらのリソースが参加者がCamlangを習得し、Camlangのタスクを成功裏に解決するのに十分であることが示された。評価を具体化するために、CommonsenseQAをCamlangに適応させ、Camlang-CSQA-v0を作成した。これは、質問を解決するために文法規則と語彙マッピングを適用する必要がある、より広範なタスクスイートの最初のタスクである。実験結果は、GPT-5が英語では98％のEM精度を達成するが、Camlangでは47％に留まり、人間の87％の性能を大きく下回ることを示している。他の最先端の推論LLMsはさらに低い性能を示す。人間による検証は、モデルの成功のほとんどが浅い語彙の一致に起因していることを明らかにし、GPT-5は限定的な範囲でメタ言語的意識を示すものの、人間のような体系的な文法的習得には至っていないことを示している。Camlangは、現在のモデルと人間のメタ言語的能力との間に存在する根本的なギャップを明らかにする、認知科学的に基づいた評価パラダイムを確立する。

English

Large Language Models (LLMs) achieve gold-medal performance across many benchmarks, yet it remains unclear whether such success reflects genuine reasoning or pattern matching. From a cognitive science perspective, an informative test is whether models can master an unfamiliar language through explicit metalinguistic deductive learning, a paradigm where human learners can reliably internalise grammatical systems through metalinguistic reasoning. We address this question with Camlang, a novel constructed language that exhibits naturalistic yet unattested feature combinations. Camlang consists of two explicit resources, a grammar book and a bilingual dictionary, which mirror adult second-language learning via explicit grammar rules and lexical lookup, and enable us to disentangle errors in morpho-syntax, lexical semantics, and sentence-level reasoning. Human experiments show that these resources are sufficient for participants to acquire Camlang and successfully solve Camlang tasks. To operationalise evaluation, we adapt CommonsenseQA into Camlang, creating Camlang-CSQA-v0, the first task in a broader suite where solving questions requires applying grammar rules and lexical mappings. Experimental results show that GPT-5 achieves 98\% EM accuracy in English but only 47\% in Camlang, far below human performance at 87\%, while other state-of-the-art reasoning LLMs perform even worse. Human verification further reveals that most model successes stem from shallow lexical alignment while GPT-5 shows emerging metalinguistic awareness to a limited extent but not systematic grammatical mastery as humans. Camlang establishes a cognitively grounded evaluation paradigm that exposes fundamental gaps between current models and human metalinguistic competence.

空っぽの部屋の金メダル：Camlangを用いたLLMのメタ言語的推論の診断

The Gold Medals in an Empty Room: Diagnosing Metalinguistic Reasoning in LLMs with Camlang

要旨

Support