语言模型可靠性的最小化探针：计数能力

摘要

大型语言模型在数学推理、编程和文档分析等基准测试中表现优异，显示出其遵循指令的广泛能力。然而，这种成功究竟反映的是普遍逻辑能力、对习得程序的重复应用，还是模仿规则执行的模式匹配，目前尚不明确。我们通过引入稳定计数能力这一测定方法展开研究：该测试要求模型持续统计重复符号直至失效。该测定消除了评估中的知识依赖、语义歧义，规避了词汇和分词干扰，为标准知识基准之外的程序可靠性提供了直接度量。通过对超过100个模型变体的测试发现，稳定计数能力远低于宣传的上下文限制上限。模型行为既不符合开放式逻辑特征，也非稳定应用习得规则，而是表现为使用有限数量的类计数内部状态——类似于扳手指计数。一旦该资源耗尽，规则遵循的表象便会消失，精确执行将退化为随机猜测，即使增加测试时计算资源也无济于事。这些发现表明，当前语言模型的流畅表现并不能保证其具有普遍、可靠的规则遵循能力。

English

Large language models perform strongly on benchmarks in mathematical reasoning, coding and document analysis, suggesting a broad ability to follow instructions. However, it remains unclear whether such success reflects general logical competence, repeated application of learned procedures, or pattern matching that mimics rule execution. We investigate this question by introducing Stable Counting Capacity, an assay in which models count repeated symbols until failure. The assay removes knowledge dependencies, semantics and ambiguity from evaluation, avoids lexical and tokenization confounds, and provides a direct measure of procedural reliability beyond standard knowledge-based benchmarks. Here we show, across more than 100 model variants, that stable counting capacity remains far below advertised context limits. Model behavior is consistent neither with open-ended logic nor with stable application of a learned rule, but instead with use of a finite set of count-like internal states, analogous to counting on fingers. Once this resource is exhausted, the appearance of rule following disappears and exact execution collapses into guessing, even with additional test-time compute. These findings show that fluent performance in current language models does not guarantee general, reliable rule following.

语言模型可靠性的最小化探针：计数能力

Counting as a minimal probe of language model reliability

摘要

Support