言語モデルの信頼性を測る最小限のプローブとしての計数能力

要旨

大規模言語モデルは、数学的推論、コーディング、文書分析におけるベンチマークで高い性能を示し、広範な指示追従能力を示唆している。しかし、このような成功が、一般的な論理能力を反映しているのか、学習した手続きの反復的な適用なのか、あるいは規則実行を模倣するパターンマッチングなのかは不明である。我々はこの問題を検討するため、モデルが繰り返し記号を数え、失敗するまで継続する測定法である「安定計数能力」を提案する。この測定法は、評価から知識依存性、意味論、曖昧性を排除し、語彙やトークン化に起因する混同を避け、標準的な知識ベースのベンチマークを超えた手続き的信頼性を直接測定する。100を超えるモデル変種において、安定計数能力は公称されているコンテキスト限界をはるかに下回ることを示す。モデルの振る舞いは、無限の論理能力とも、学習した規則の安定的な適用とも一致せず、むしろ指で数えることに類似した、有限の計数様内部状態の使用と一致する。このリソースが枯渇すると、規則に従う外見は消え、追加のテスト時計算資源があっても、正確な実行は推測に崩壊する。これらの発見は、現在の言語モデルにおける流暢な性能が、一般的で信頼性の高い規則遵守を保証しないことを示している。

English

Large language models perform strongly on benchmarks in mathematical reasoning, coding and document analysis, suggesting a broad ability to follow instructions. However, it remains unclear whether such success reflects general logical competence, repeated application of learned procedures, or pattern matching that mimics rule execution. We investigate this question by introducing Stable Counting Capacity, an assay in which models count repeated symbols until failure. The assay removes knowledge dependencies, semantics and ambiguity from evaluation, avoids lexical and tokenization confounds, and provides a direct measure of procedural reliability beyond standard knowledge-based benchmarks. Here we show, across more than 100 model variants, that stable counting capacity remains far below advertised context limits. Model behavior is consistent neither with open-ended logic nor with stable application of a learned rule, but instead with use of a finite set of count-like internal states, analogous to counting on fingers. Once this resource is exhausted, the appearance of rule following disappears and exact execution collapses into guessing, even with additional test-time compute. These findings show that fluent performance in current language models does not guarantee general, reliable rule following.

言語モデルの信頼性を測る最小限のプローブとしての計数能力

Counting as a minimal probe of language model reliability

要旨

Support