分词的局限性:分词的诅咒
Tokenization Falling Short: The Curse of Tokenization
June 17, 2024
作者: Yekun Chai, Yewei Fang, Qiwei Peng, Xuhong Li
cs.AI
摘要
语言模型通常将原始文本标记为预定义词汇表中的子词标识序列,这是一个对错别字、长度变化敏感且基本忽略标记内部结构的过程,我们称之为标记化的诅咒。在本研究中,我们深入探讨了这些缺点,并证明大型语言模型(LLMs)仍然容易受到这些问题的影响。本研究系统地研究了这些挑战及其对LLMs的影响,通过三个关键研究问题进行:(1)复杂问题解决,(2)标记结构探测,以及(3)对错别字变化的弹性。我们的研究结果显示,扩展模型参数可以缓解标记化问题;然而,LLMs仍然受到错别字和其他文本格式变化引起的偏见影响。我们的实验表明,诸如BPE-dropout等子词正则化方法可以缓解这一问题。我们将发布我们的代码和数据以促进进一步研究。
English
Language models typically tokenize raw text into sequences of subword
identifiers from a predefined vocabulary, a process inherently sensitive to
typographical errors, length variations, and largely oblivious to the internal
structure of tokens-issues we term the curse of tokenization. In this study, we
delve into these drawbacks and demonstrate that large language models (LLMs)
remain susceptible to these problems. This study systematically investigates
these challenges and their impact on LLMs through three critical research
questions: (1) complex problem solving, (2) token structure probing, and (3)
resilience to typographical variation. Our findings reveal that scaling model
parameters can mitigate the issue of tokenization; however, LLMs still suffer
from biases induced by typos and other text format variations. Our experiments
show that subword regularization such as BPE-dropout can mitigate this issue.
We will release our code and data to facilitate further research.Summary
AI-Generated Summary