TokSuite:評估分詞器選擇對語言模型行為的影響
TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior
December 23, 2025
作者: Gül Sena Altıntaş, Malikeh Ehghaghi, Brian Lester, Fengyuan Liu, Wanru Zhao, Marco Ciccone, Colin Raffel
cs.AI
摘要
分詞器為語言模型處理和表示文本提供了基礎支撐。儘管分詞技術至關重要,但由於難以單獨衡量其影響,人們對其在語言模型性能與行為中扮演的角色仍缺乏深入理解。為解決這一需求,我們推出TokSuite——一套支持分詞對語言模型影響研究的模型集合與基準測試平台。具體而言,我們採用相同架構、數據集、訓練預算和初始化參數,訓練了十四個僅分詞器不同的對照模型。此外,我們精心構建並發布了新型基準測試,專門衡量模型在可能影響分詞效果的現實擾動下的表現。TokSuite通過實現模型分詞器影響力的強健解耦,支撐了一系列創新發現,系統闡明了多種主流分詞器各自的優勢與侷限。
English
Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization's influence on LMs. Specifically, we train fourteen models that use different tokenizers but are otherwise identical using the same architecture, dataset, training budget, and initialization. Additionally, we curate and release a new benchmark that specifically measures model performance subject to real-world perturbations that are likely to influence tokenization. Together, TokSuite allows robust decoupling of the influence of a model's tokenizer, supporting a series of novel findings that elucidate the respective benefits and shortcomings of a wide range of popular tokenizers.