通过字节级接口实现跨分词器的大语言模型蒸馏
Cross-Tokenizer LLM Distillation through a Byte-Level Interface
April 13, 2026
作者: Avyav Kumar Singh, Yen-Chen Wu, Alexandru Cioba, Alberto Bernacchia, Davide Buffelli
cs.AI
摘要
跨分词器蒸馏(CTD)作为解决师生语言模型使用不同分词器时知识迁移的课题,目前仍属未完全破解的难题。现有方法依赖启发式策略对齐不匹配的词表,引入了显著复杂性。本文提出名为字节级蒸馏(BLD)的简洁有效基线方案,通过在不同分词器间构建通用接口——字节层面——实现CTD。具体而言,我们将教师的输出分布转换为字节级概率,为学生模型附加轻量级字节级解码头,通过这一共享的字节接口进行蒸馏。尽管方案简单,但在涵盖10亿至80亿参数模型的一系列蒸馏任务中,BLD与复杂得多的CTD方法相比表现相当,并在多个基准测试中实现超越。我们的研究表明字节层面是跨分词器知识传递的天然共通层,同时亦揭示所有任务和基准测试中均实现持续改进仍具挑战,这凸显CTD仍是待解之谜。
English
Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher's output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with--and on several benchmarks surpasses--significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.