바이트 수준 인터페이스를 통한 교차 토크나이저 LLM 지식 증류

초록

교차 토크나이저 지식 증류(CTD)는 서로 다른 토크나이저를 사용하는 두 언어 모델 간에 지식을 교사 모델에서 학생 모델로 전달하는 과제로, 여전히 해결되지 않은 문제로 남아 있습니다. 기존 접근법은 일치하지 않는 어휘 집합을 정렬하기 위해 휴리스틱 전략에 의존하여 상당한 복잡성을 초래해 왔습니다. 본 논문에서는 토크나이저 간 공통 인터페이스인 바이트 수준에서 작동함으로써 CTD를 가능하게 하는 간단하면서 효과적인 기준선인 바이트 수준 증류(BLD)를 제안합니다. 구체적으로, 우리는 교사 모델의 출력 분포를 바이트 수준 확률로 변환하고 학생 모델에 경량 바이트 수준 디코더 헤드를 추가하여 이 공유 바이트 수준 인터페이스를 통해 증류를 수행합니다. 단순함에도 불구하고, BLD는 1B에서 8B 매개변수 규모의 모델을 사용한 다양한 증류 과제에서 상당히 더 정교한 CTD 방법들과 견줄 만한 성능을 보이며, 여러 벤치마크에서는 이를 능가하기도 합니다. 우리의 결과는 바이트 수준이 교차 토크나이저 지식 전송을 위한 자연스러운 공통 기반임을 시사하는 동시에, 모든 과제와 벤치마크에서 일관된 개선이 여전히 어려운 과제임을 강조하여 CTD가 여전히 열린 문제임을 부각합니다.

English

Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher's output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with--and on several benchmarks surpasses--significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.

바이트 수준 인터페이스를 통한 교차 토크나이저 LLM 지식 증류

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

초록

Support