TokDrift：當LLM以子詞發聲而代碼以語法言說

摘要

針對代碼的大型語言模型（LLMs）依賴於從混合自然語言文本和程序語言代碼中學習的子詞標記器，如字節對編碼（BPE），其驅動力來自統計而非語法。因此，語義相同的代碼片段可能會因表面因素（如空白或標識符命名）的不同而被標記化。為衡量這種不對齊的影響，我們引入了TokDrift框架，該框架應用語義保持的重寫規則來創建僅在標記化上有所不同的代碼變體。在包括參數超過300億的大型模型在內的九個代碼LLMs中，即使微小的格式變化也可能導致模型行為的顯著變化。層次分析顯示，問題源於早期的嵌入層，其中子詞分割未能捕捉語法標記的邊界。我們的研究發現，將不對齊的標記化視為可靠代碼理解與生成的隱藏障礙，強調了未來代碼LLMs需要語法感知的標記化。

English

Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.

TokDrift：當LLM以子詞發聲而代碼以語法言說

TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

摘要

Support