基于字节级接口的跨分词器大型语言模型蒸馏

摘要

跨分词器蒸馏（CTD）作为一种在不同分词器的师生语言模型间进行知识迁移的技术，目前仍是一个尚未完全解决的难题。现有方法依赖启发式策略对齐不匹配的词表，引入了显著复杂性。本文提出一种简单有效的基线方法——字节级蒸馏（BLD），通过在不同分词器的共同接口（即字节层面）进行操作来实现CTD。具体而言，我们将教师模型的输出分布转换为字节级概率，为学生模型附加轻量级字节级解码头，并通过这一共享的字节级接口进行蒸馏。尽管方法简单，BLD在涵盖1B至8B参数模型的多种蒸馏任务中，与复杂得多的CTD方法相比具有竞争力，并在若干基准测试中实现超越。我们的研究表明，字节层面是跨分词器知识迁移的天然共通层，同时亦凸显出在所有任务和基准测试中实现一致改进仍具挑战性，这印证了CTD仍是待解难题的现状。

English

Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher's output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with--and on several benchmarks surpasses--significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.

基于字节级接口的跨分词器大型语言模型蒸馏

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

摘要

Support