分詞的不足：分詞的詛咒

摘要

語言模型通常將原始文本標記為來自預定義詞彙表的子詞識別符序列，這是一個對錯別字、長度變化敏感且在很大程度上忽略標記內部結構的過程，我們稱之為標記化的詛咒。在本研究中，我們深入探討這些缺點並證明大型語言模型（LLMs）仍然容易受到這些問題的影響。本研究系統地研究了這些挑戰及其對LLMs的影響，通過三個關鍵研究問題：（1）複雜問題解決、（2）標記結構探查和（3）對錯別字變化的韌性。我們的研究結果顯示，擴展模型參數可以緩解標記化問題；然而，LLMs仍然受到錯別字和其他文本格式變化引起的偏見影響。我們的實驗表明，諸如BPE-dropout之類的子詞正則化可以緩解這個問題。我們將釋出我們的代碼和數據以促進進一步的研究。

English

Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens-issues we term the curse of tokenization. In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptible to these problems. This study systematically investigates these challenges and their impact on LLMs through three critical research questions: (1) complex problem solving, (2) token structure probing, and (3) resilience to typographical variation. Our findings reveal that scaling model parameters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations. Our experiments show that subword regularization such as BPE-dropout can mitigate this issue. We will release our code and data to facilitate further research.

分詞的不足：分詞的詛咒

Tokenization Falling Short: The Curse of Tokenization

摘要

Support