TokSuite:评估分词器选择对语言模型行为的影响
TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior
December 23, 2025
作者: Gül Sena Altıntaş, Malikeh Ehghaghi, Brian Lester, Fengyuan Liu, Wanru Zhao, Marco Ciccone, Colin Raffel
cs.AI
摘要
分词器为语言模型处理与表示文本提供了基础支撑。尽管分词技术至关重要,但由于难以单独衡量其影响,人们对分词在语言模型性能与行为中作用的理解仍十分有限。为应对这一需求,我们推出TokSuite——一套支持分词对语言模型影响研究的模型集合与基准测试平台。具体而言,我们采用相同架构、数据集、训练预算和初始化参数,训练了十四组仅分词器不同的同构模型。此外,我们还构建并发布了新型基准测试,专门衡量模型在可能影响分词效果的真实扰动场景下的性能表现。TokSuite通过稳健的解耦分析揭示了各类流行分词器的优势与局限,由此获得的一系列新发现阐明了分词器对语言模型的实际影响。
English
Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization's influence on LMs. Specifically, we train fourteen models that use different tokenizers but are otherwise identical using the same architecture, dataset, training budget, and initialization. Additionally, we curate and release a new benchmark that specifically measures model performance subject to real-world perturbations that are likely to influence tokenization. Together, TokSuite allows robust decoupling of the influence of a model's tokenizer, supporting a series of novel findings that elucidate the respective benefits and shortcomings of a wide range of popular tokenizers.