大语言模型中的分词约束：符号与算术推理局限研究

摘要

分词是语言模型中首个——却常被低估的——计算层级。尽管思维链（CoT）提示通过外化中间步骤使Transformer模型能够近似递归计算，但我们揭示，此类推理的成功从根本上受限于分词输入的结构。本研究从理论与实证角度探讨了分词方案，尤其是基于子词的方法如字节对编码（BPE），如何通过合并或模糊基本推理单元来阻碍符号计算。我们引入了“分词意识”这一概念，以形式化不良分词粒度如何破坏逻辑对齐并阻止模型泛化符号程序。通过对算术与符号任务的系统评估，我们展示了分词结构对推理性能的显著影响，即便使用CoT也会导致失败，而原子对齐的格式则能开启强大的泛化能力，使小型模型（如GPT-4o-mini）在结构化推理上超越更大系统（如o1）。我们的发现表明，大语言模型中的符号推理能力并非纯粹由架构决定，而是深刻依赖于分词层面的表示。

English

Tokenization is the first - and often underappreciated - layer of computation in language models. While Chain-of-Thought (CoT) prompting enables transformer models to approximate recurrent computation by externalizing intermediate steps, we show that the success of such reasoning is fundamentally bounded by the structure of tokenized inputs. This work presents a theoretical and empirical investigation into how tokenization schemes, particularly subword-based methods like byte-pair encoding (BPE), impede symbolic computation by merging or obscuring atomic reasoning units. We introduce the notion of Token Awareness to formalize how poor token granularity disrupts logical alignment and prevents models from generalizing symbolic procedures. Through systematic evaluation on arithmetic and symbolic tasks, we demonstrate that token structure dramatically affect reasoning performance, causing failure even with CoT, while atomically-aligned formats unlock strong generalization, allowing small models (e.g., GPT-4o-mini) to outperform larger systems (e.g., o1) in structured reasoning. Our findings reveal that symbolic reasoning ability in LLMs is not purely architectural, but deeply conditioned on token-level representations.

大语言模型中的分词约束：符号与算术推理局限研究

Tokenization Constraints in LLMs: A Study of Symbolic and Arithmetic Reasoning Limits

摘要

Support