分詞的不足:分詞的詛咒
Tokenization Falling Short: The Curse of Tokenization
June 17, 2024
作者: Yekun Chai, Yewei Fang, Qiwei Peng, Xuhong Li
cs.AI
摘要
語言模型通常將原始文本標記為來自預定義詞彙表的子詞識別符序列,這是一個對錯別字、長度變化敏感且在很大程度上忽略標記內部結構的過程,我們稱之為標記化的詛咒。在本研究中,我們深入探討這些缺點並證明大型語言模型(LLMs)仍然容易受到這些問題的影響。本研究系統地研究了這些挑戰及其對LLMs的影響,通過三個關鍵研究問題:(1)複雜問題解決、(2)標記結構探查和(3)對錯別字變化的韌性。我們的研究結果顯示,擴展模型參數可以緩解標記化問題;然而,LLMs仍然受到錯別字和其他文本格式變化引起的偏見影響。我們的實驗表明,諸如BPE-dropout之類的子詞正則化可以緩解這個問題。我們將釋出我們的代碼和數據以促進進一步的研究。
English
Language models typically tokenize raw text into sequences of subword
identifiers from a predefined vocabulary, a process inherently sensitive to
typographical errors, length variations, and largely oblivious to the internal
structure of tokens-issues we term the curse of tokenization. In this study, we
delve into these drawbacks and demonstrate that large language models (LLMs)
remain susceptible to these problems. This study systematically investigates
these challenges and their impact on LLMs through three critical research
questions: (1) complex problem solving, (2) token structure probing, and (3)
resilience to typographical variation. Our findings reveal that scaling model
parameters can mitigate the issue of tokenization; however, LLMs still suffer
from biases induced by typos and other text format variations. Our experiments
show that subword regularization such as BPE-dropout can mitigate this issue.
We will release our code and data to facilitate further research.