TokDrift: LLMがサブワードで話すとき、コードは文法で話す

要旨

コード向け大規模言語モデル（LLMs）は、自然言語テキストとプログラミング言語コードの混合から学習されたバイトペア符号化（BPE）などのサブワードトークナイザーに依存しており、文法ではなく統計に基づいて動作する。その結果、意味的に同一のコードスニペットでも、空白や識別子の命名といった表面的な要因によって異なるトークン化が行われる可能性がある。この不整合の影響を測定するため、我々はTokDriftというフレームワークを導入し、トークン化のみが異なるコード変種を生成するための意味を保持した書き換えルールを適用する。30億パラメータを超える大規模モデルを含む9つのコードLLMsにおいて、わずかなフォーマット変更でもモデルの挙動に大きな変化が生じることが確認された。層ごとの分析により、この問題は初期の埋め込み段階で発生し、サブワード分割が文法トークンの境界を適切に捉えられないことが原因であることが明らかになった。我々の研究結果は、トークン化の不整合が信頼性のあるコード理解と生成における隠れた障害であることを示し、将来のコードLLMsにおいて文法を意識したトークン化の必要性を強調するものである。

English

Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.

TokDrift: LLMがサブワードで話すとき、コードは文法で話す

TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

要旨

Support