トークン化の限界：トークン化の呪い

要旨

言語モデルは通常、生のテキストを事前定義された語彙からサブワード識別子のシーケンスにトークン化しますが、このプロセスは本質的に、タイポグラフィックエラーや長さの変動に敏感であり、トークンの内部構造をほとんど認識しないという問題を抱えています。この問題を私たちは「トークン化の呪い」と呼びます。本研究では、これらの欠点について深く掘り下げ、大規模言語モデル（LLMs）がこれらの問題に対して依然として脆弱であることを実証します。本研究では、以下の3つの重要な研究課題を通じて、これらの課題とLLMsへの影響を体系的に調査します：（1）複雑な問題解決、（2）トークン構造のプロービング、（3）タイポグラフィックな変動に対する耐性。私たちの調査結果は、モデルのパラメータをスケールアップすることでトークン化の問題を緩和できることを示していますが、LLMsは依然としてタイポやその他のテキスト形式の変動によって引き起こされるバイアスに悩まされています。私たちの実験では、BPE-dropoutなどのサブワード正則化がこの問題を緩和できることを示しています。さらなる研究を促進するために、私たちはコードとデータを公開する予定です。

English

Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens-issues we term the curse of tokenization. In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptible to these problems. This study systematically investigates these challenges and their impact on LLMs through three critical research questions: (1) complex problem solving, (2) token structure probing, and (3) resilience to typographical variation. Our findings reveal that scaling model parameters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations. Our experiments show that subword regularization such as BPE-dropout can mitigate this issue. We will release our code and data to facilitate further research.

トークン化の限界：トークン化の呪い

Tokenization Falling Short: The Curse of Tokenization

要旨

Support