토크드리프트: LLM이 서브워드로 말할 때 코드는 문법으로 말한다

초록

코드를 위한 대형 언어 모델(LLMs)은 자연어 텍스트와 프로그래밍 언어 코드가 혼합된 데이터로부터 학습된 바이트 페어 인코딩(BPE)과 같은 서브워드 토크나이저에 의존하지만, 이는 문법보다는 통계에 의해 주도된다. 그 결과, 의미적으로 동일한 코드 조각도 공백이나 식별자 명명과 같은 표면적인 요소에 따라 다르게 토큰화될 수 있다. 이러한 불일치의 영향을 측정하기 위해, 우리는 토큰화만 다른 코드 변형을 생성하기 위해 의미를 보존하는 재작성 규칙을 적용하는 TokDrift 프레임워크를 소개한다. 30B 이상의 매개변수를 가진 대형 모델을 포함한 9개의 코드 LLM에서, 사소한 형식 변경조차도 모델 동작에 상당한 변화를 일으킬 수 있음을 확인했다. 계층별 분석은 이 문제가 초기 임베딩 단계에서 발생하며, 서브워드 분할이 문법 토큰 경계를 제대로 포착하지 못함을 보여준다. 우리의 연구 결과는 신뢰할 수 있는 코드 이해 및 생성을 위한 숨겨진 장애물로 토큰화의 불일치를 지적하며, 향후 코드 LLM을 위한 문법 인식 토큰화의 필요성을 강조한다.

English

Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.

토크드리프트: LLM이 서브워드로 말할 때 코드는 문법으로 말한다

TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

초록

Support