過度分詞的Transformer：詞彙普遍值得調整

摘要

Tokenization 是大型語言模型 (LLMs) 的基本組件，然而其對模型擴展和性能的影響尚未完全探討。在本文中，我們介紹了一個新穎的框架，稱為 Over-Tokenized Transformers，該框架將輸入和輸出詞彙解耦以提高語言建模性能。具體而言，我們的方法通過擴展輸入詞彙以利用多字元標記。通過大量實驗，我們揭示了輸入詞彙大小與訓練損失之間的對數線性關係，表明較大的輸入詞彙始終能提升模型性能，無論模型大小如何。使用大型輸入詞彙，我們實現了與雙倍基準線性能相當的表現，而無需額外成本。我們的研究強調了在擴展規則中的 tokenization 的重要性，並為 tokenizer 設計提供了實用見解，為更高效和強大的 LLMs 鋪平了道路。

English

Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.

過度分詞的Transformer：詞彙普遍值得調整

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

摘要

Support