TokDrift：当大语言模型以子词表达，而代码以语法诉说

摘要

面向代码的大型语言模型（LLMs）依赖于子词分词器，如从自然语言文本与编程语言代码混合数据中学习得到的字节对编码（BPE），其驱动因素为统计而非语法。因此，语义相同的代码片段可能因空格或标识符命名等表面因素而被不同地分词。为衡量这种不对齐的影响，我们引入了TokDrift框架，该框架应用语义保持的重写规则生成仅在分词上存在差异的代码变体。在包括参数超过300亿的大型模型在内的九种代码LLMs中，即便是细微的格式变化也能引发模型行为的显著偏移。层次分析表明，问题源于早期的嵌入层，其中子词分割未能捕捉到语法标记的边界。我们的研究揭示，分词不对齐是阻碍代码可靠理解与生成的一个隐性障碍，强调了未来代码LLMs需采用语法感知的分词方法。

English

Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.

TokDrift：当大语言模型以子词表达，而代码以语法诉说

TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

摘要

Support