ChatPaper.aiChatPaper

透過詞彙課程擴展大規模語言模型預訓練

Scaling LLM Pre-training with Vocabulary Curriculum

February 25, 2025
作者: Fangyuan Yu
cs.AI

摘要

現代語言模型依賴於預訓練前固定的靜態詞彙表,這與人類語言學習中觀察到的適應性詞彙獲取形成對比。為彌合這一差距,我們引入了詞彙課程學習方法,該方法相對於詞彙大小實現了對數線性比例的預訓練效率提升。我們的方法在熵引導的詞彙擴展和模型優化之間交替進行,使模型能夠學習跨多種分詞粒度的可遷移表示。這種方法自然產生了一種最佳的計算分配模式:較長的詞彙捕捉可預測的內容,而較短的詞彙則專注於更複雜、更難預測的上下文。在小型GPT模型上的實驗展示了改進的擴展效率,強化了動態分詞的有效性。我們公開了代碼以支持進一步研究,並計劃將實驗擴展到更大的模型和更多樣化的領域。
English
Modern language models rely on static vocabularies, fixed before pretraining, in contrast to the adaptive vocabulary acquisition observed in human language learning. To bridge this gap, we introduce vocabulary curriculum learning, an approach that improves pretraining efficiency with log-linear scaling gains relative to vocabulary size. Our method alternates between entropy-guided vocabulary expansion and model optimization, enabling models to learn transferable representations across diverse tokenization granularities. This approach naturally gives rise to an optimal computation allocation pattern: longer tokens capture predictable content, while shorter tokens focus on more complex, harder-to-predict contexts. Experiments on small-scale GPT models demonstrate improved scaling efficiency, reinforcing the effectiveness of dynamic tokenization. We release our code to support further research and plan to extend our experiments to larger models and diverse domains.

Summary

AI-Generated Summary

PDF12February 26, 2025