ChatPaper.aiChatPaper

克服詞彙不匹配:詞彙無關的教師指導語言建模

Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling

March 24, 2025
作者: Haebin Shin, Lei Ji, Xiao Liu, Yeyun Gong
cs.AI

摘要

利用大型教師模型指導小型學生模型的訓練,已成為高效學習的主流範式。然而,教師與學生語言模型之間的詞彙不匹配問題,在語言建模中帶來了顯著挑戰,導致分化的標記序列和輸出分佈。為克服這些限制,我們提出了詞彙無關的教師指導語言建模(VocAgnoLM),這是一種新穎的方法,通過兩種關鍵策略彌合詞彙不匹配造成的差距:(1) 標記級詞彙對齊,在不相容的詞彙間對齊標記序列;(2) 教師指導損失,利用教師模型的損失來有效指導學生訓練。我們展示了在語言建模中的有效性,使用不同詞彙的7B教師模型來指導1B學生模型。值得注意的是,在與TinyLlama僅共享約6%詞彙的Qwen2.5-Math-Instruct教師模型上,VocAgnoLM相比於單純的持續預訓練,性能提升了46%。此外,我們證明VocAgnoLM能持續受益於更強的教師模型,為語言建模中的詞彙不匹配問題提供了穩健的解決方案。
English
Using large teacher models to guide the training of smaller student models has become the prevailing paradigm for efficient and effective learning. However, vocabulary mismatches between teacher and student language models pose significant challenges in language modeling, resulting in divergent token sequences and output distributions. To overcome these limitations, we propose Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), a novel approach that bridges the gap caused by vocabulary mismatch through two key methods: (1) Token-level Lexical Alignment, which aligns token sequences across mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss of teacher model to guide effective student training. We demonstrate its effectiveness in language modeling with 1B student model using various 7B teacher models with different vocabularies. Notably, with Qwen2.5-Math-Instruct, a teacher model sharing only about 6% of its vocabulary with TinyLlama, VocAgnoLM achieves a 46% performance improvement compared to naive continual pretraining. Furthermore, we demonstrate that VocAgnoLM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling.

Summary

AI-Generated Summary

PDF22March 26, 2025