克服词汇不匹配:词汇无关的教师引导语言建模
Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling
March 24, 2025
作者: Haebin Shin, Lei Ji, Xiao Liu, Yeyun Gong
cs.AI
摘要
利用大型教师模型指导小型学生模型的训练,已成为实现高效学习的主流范式。然而,教师与学生语言模型之间的词汇不匹配问题在语言建模中带来了显著挑战,导致分词序列和输出分布出现偏差。为克服这些限制,我们提出了词汇无关的教师引导语言建模(VocAgnoLM),这一新颖方法通过两种关键策略弥合了词汇不匹配造成的鸿沟:(1)分词级词汇对齐,它在不匹配的词汇表间对齐分词序列;(2)教师引导损失,它利用教师模型的损失来有效指导学生模型的训练。我们通过使用不同词汇表的多种7B教师模型指导1B学生模型进行语言建模,验证了该方法的有效性。值得注意的是,在与TinyLlama仅共享约6%词汇的Qwen2.5-Math-Instruct教师模型上,VocAgnoLM相比简单的持续预训练实现了46%的性能提升。此外,我们证明VocAgNoLM始终能从更强的教师模型中获益,为语言建模中的词汇不匹配问题提供了稳健的解决方案。
English
Using large teacher models to guide the training of smaller student models
has become the prevailing paradigm for efficient and effective learning.
However, vocabulary mismatches between teacher and student language models pose
significant challenges in language modeling, resulting in divergent token
sequences and output distributions. To overcome these limitations, we propose
Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), a novel
approach that bridges the gap caused by vocabulary mismatch through two key
methods: (1) Token-level Lexical Alignment, which aligns token sequences across
mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss
of teacher model to guide effective student training. We demonstrate its
effectiveness in language modeling with 1B student model using various 7B
teacher models with different vocabularies. Notably, with
Qwen2.5-Math-Instruct, a teacher model sharing only about 6% of its vocabulary
with TinyLlama, VocAgnoLM achieves a 46% performance improvement compared to
naive continual pretraining. Furthermore, we demonstrate that VocAgnoLM
consistently benefits from stronger teacher models, providing a robust solution
to vocabulary mismatches in language modeling.Summary
AI-Generated Summary