隨著詞彙量的擴大而產生的規模定律：更大的模型應配以更龐大的詞彙量

摘要

對於擴展大型語言模型（LLMs）的研究主要集中在模型參數和訓練數據大小，忽略了詞彙大小的作用。從直觀上看，更大的詞彙庫可以通過用更少的標記來表示句子，實現更有效的標記化，但也增加了對於罕見標記的欠擬合風險。我們通過在多達500B個字符上訓練範圍從33M到3B參數的模型，並使用各種詞彙配置，來研究詞彙大小如何影響LLM的擴展定律。我們提出了三種互補的方法來預測計算最佳詞彙大小：IsoFLOPs分析、導數估計和損失函數的參數擬合。我們的方法收斂於同一結果，即最佳詞彙大小取決於可用的計算預算，並且更大的模型應該使用更大的詞彙庫。然而，大多數LLMs使用的詞彙大小都太小。例如，我們預測Llama2-70B的最佳詞彙大小應該至少為216K，比其32K的詞彙庫大7倍。我們通過在不同FLOPs預算下訓練具有3B參數的模型來實證我們的預測。採用我們預測的最佳詞彙大小一致地提高了在常用詞彙大小上的下游性能。通過將詞彙大小從傳統的32K增加到43K，我們在相同的2.3e21 FLOPs上將ARC-Challenge的性能從29.1提高到32.0。我們的工作強調了共同考慮模型參數和詞彙大小以實現有效擴展的必要性。

English

Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. % Intuitively, larger vocabularies enable more efficient tokenization by representing sentences with fewer tokens, but they also increase the risk of under-fitting representations for rare tokens. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Our approaches converge on the same result that the optimal vocabulary size depends on the available compute budget and that larger models deserve larger vocabularies. However, most LLMs use too small vocabulary sizes. For example, we predict that the optimal vocabulary size of Llama2-70B should have been at least 216K, 7 times larger than its vocabulary of 32K. We validate our predictions empirically by training models with 3B parameters across different FLOPs budgets. Adopting our predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes. By increasing the vocabulary size from the conventional 32K to 43K, we improve performance on ARC-Challenge from 29.1 to 32.0 with the same 2.3e21 FLOPs. Our work emphasizes the necessity of jointly considering model parameters and vocabulary size for efficient scaling.

隨著詞彙量的擴大而產生的規模定律：更大的模型應配以更龐大的詞彙量

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

摘要

Support