大型語言模型中的分詞限制：符號與算術推理能力的研究

摘要

分詞（Tokenization）是語言模型中的第一層——也是常被低估的——計算環節。儘管思維鏈（Chain-of-Thought, CoT）提示使變換器模型能夠通過外化中間步驟來近似遞歸計算，我們的研究表明，此類推理的成功根本上受制於分詞輸入的結構。本文從理論和實證角度探討了分詞方案——尤其是基於子詞的方法如字節對編碼（BPE）——如何通過合併或模糊原子推理單元來阻礙符號計算。我們引入了「分詞意識」（Token Awareness）的概念，以形式化地說明分詞粒度不足如何破壞邏輯對齊並阻礙模型泛化符號程序。通過在算術和符號任務上的系統性評估，我們證明分詞結構顯著影響推理性能，即使在CoT下也會導致失敗，而原子對齊的格式則能釋放強大的泛化能力，使小型模型（如GPT-4o-mini）在結構化推理中超越更大規模的系統（如o1）。我們的研究揭示，大型語言模型（LLMs）的符號推理能力並非純粹依賴於架構，而是深度依賴於分詞層面的表示。

English

Tokenization is the first - and often underappreciated - layer of computation in language models. While Chain-of-Thought (CoT) prompting enables transformer models to approximate recurrent computation by externalizing intermediate steps, we show that the success of such reasoning is fundamentally bounded by the structure of tokenized inputs. This work presents a theoretical and empirical investigation into how tokenization schemes, particularly subword-based methods like byte-pair encoding (BPE), impede symbolic computation by merging or obscuring atomic reasoning units. We introduce the notion of Token Awareness to formalize how poor token granularity disrupts logical alignment and prevents models from generalizing symbolic procedures. Through systematic evaluation on arithmetic and symbolic tasks, we demonstrate that token structure dramatically affect reasoning performance, causing failure even with CoT, while atomically-aligned formats unlock strong generalization, allowing small models (e.g., GPT-4o-mini) to outperform larger systems (e.g., o1) in structured reasoning. Our findings reveal that symbolic reasoning ability in LLMs is not purely architectural, but deeply conditioned on token-level representations.

大型語言模型中的分詞限制：符號與算術推理能力的研究

Tokenization Constraints in LLMs: A Study of Symbolic and Arithmetic Reasoning Limits

摘要

Support