어휘 불일치 극복: 어휘 독립적 교사 지도 언어 모델링

초록

대형 교사 모델을 활용하여 더 작은 학생 모델의 학습을 안내하는 것은 효율적이고 효과적인 학습을 위한 주류 패러다임이 되었습니다. 그러나 교사와 학생 언어 모델 간의 어휘 불일치는 언어 모델링에서 상당한 문제를 야기하며, 이는 서로 다른 토큰 시퀀스와 출력 분포를 초래합니다. 이러한 한계를 극복하기 위해, 우리는 어휘 불일치로 인한 격차를 해소하는 새로운 접근법인 어휘 독립적 교사 지도 언어 모델링(Vocabulary-agnostic Teacher Guided Language Modeling, VocAgnoLM)을 제안합니다. 이 방법은 두 가지 핵심 기법을 통해 작동합니다: (1) 토큰 수준 어휘 정렬(Token-level Lexical Alignment), 이는 불일치하는 어휘 간의 토큰 시퀀스를 정렬하며, (2) 교사 지도 손실(Teacher Guided Loss), 이는 교사 모델의 손실을 활용하여 학생 모델의 효과적인 학습을 안내합니다. 우리는 다양한 어휘를 가진 7B 교사 모델을 사용하여 1B 학생 모델의 언어 모델링에서 이 방법의 효과를 입증했습니다. 특히, TinyLlama와 약 6%의 어휘만을 공유하는 Qwen2.5-Math-Instruct 교사 모델을 사용할 때, VocAgnoLM은 단순한 지속적 사전 학습에 비해 46%의 성능 향상을 달성했습니다. 또한, VocAgnoLM이 더 강력한 교사 모델로부터 일관되게 이점을 얻음을 보여주며, 이는 언어 모델링에서의 어휘 불일치에 대한 견고한 해결책을 제공합니다.

English

Using large teacher models to guide the training of smaller student models has become the prevailing paradigm for efficient and effective learning. However, vocabulary mismatches between teacher and student language models pose significant challenges in language modeling, resulting in divergent token sequences and output distributions. To overcome these limitations, we propose Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), a novel approach that bridges the gap caused by vocabulary mismatch through two key methods: (1) Token-level Lexical Alignment, which aligns token sequences across mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss of teacher model to guide effective student training. We demonstrate its effectiveness in language modeling with 1B student model using various 7B teacher models with different vocabularies. Notably, with Qwen2.5-Math-Instruct, a teacher model sharing only about 6% of its vocabulary with TinyLlama, VocAgnoLM achieves a 46% performance improvement compared to naive continual pretraining. Furthermore, we demonstrate that VocAgnoLM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling.

어휘 불일치 극복: 어휘 독립적 교사 지도 언어 모델링

Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling

초록

Support