克服词汇不匹配：词汇无关的教师引导语言建模

摘要

利用大型教师模型指导小型学生模型的训练，已成为实现高效学习的主流范式。然而，教师与学生语言模型之间的词汇不匹配问题在语言建模中带来了显著挑战，导致分词序列和输出分布出现偏差。为克服这些限制，我们提出了词汇无关的教师引导语言建模（VocAgnoLM），这一新颖方法通过两种关键策略弥合了词汇不匹配造成的鸿沟：（1）分词级词汇对齐，它在不匹配的词汇表间对齐分词序列；（2）教师引导损失，它利用教师模型的损失来有效指导学生模型的训练。我们通过使用不同词汇表的多种7B教师模型指导1B学生模型进行语言建模，验证了该方法的有效性。值得注意的是，在与TinyLlama仅共享约6%词汇的Qwen2.5-Math-Instruct教师模型上，VocAgnoLM相比简单的持续预训练实现了46%的性能提升。此外，我们证明VocAgNoLM始终能从更强的教师模型中获益，为语言建模中的词汇不匹配问题提供了稳健的解决方案。

English

Using large teacher models to guide the training of smaller student models has become the prevailing paradigm for efficient and effective learning. However, vocabulary mismatches between teacher and student language models pose significant challenges in language modeling, resulting in divergent token sequences and output distributions. To overcome these limitations, we propose Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), a novel approach that bridges the gap caused by vocabulary mismatch through two key methods: (1) Token-level Lexical Alignment, which aligns token sequences across mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss of teacher model to guide effective student training. We demonstrate its effectiveness in language modeling with 1B student model using various 7B teacher models with different vocabularies. Notably, with Qwen2.5-Math-Instruct, a teacher model sharing only about 6% of its vocabulary with TinyLlama, VocAgnoLM achieves a 46% performance improvement compared to naive continual pretraining. Furthermore, we demonstrate that VocAgnoLM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling.

克服词汇不匹配：词汇无关的教师引导语言建模

Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling

摘要

Support