随着词汇量的增加而产生的规模定律：更大的模型应配备更大的词汇量

摘要

大型语言模型（LLMs）的扩展研究主要集中在模型参数和训练数据规模上，忽视了词汇量的作用。直觉上，更大的词汇量能够通过用更少的标记表示句子来实现更高效的标记化，但也会增加对稀有标记拟合不足的风险。我们研究了词汇量如何影响LLM的扩展规律，通过在多达500B个字符上训练从33M到3B参数的模型，采用不同的词汇配置。我们提出了三种互补的方法来预测计算最优的词汇量：IsoFLOPs分析、导数估计和损失函数的参数拟合。我们的方法收敛于相同的结果，即最优的词汇量取决于可用的计算预算，并且更大的模型应该使用更大的词汇量。然而，大多数LLMs使用的词汇量都太小。例如，我们预测Llama2-70B的最佳词汇量至少应该为216K，比其32K的词汇量大7倍。我们通过在不同FLOPs预算下训练具有3B参数的模型来实证验证我们的预测。采用我们预测的最优词汇量一致地提高了常用词汇量的性能。通过将词汇量从传统的32K增加到43K，我们在相同的2.3e21 FLOPs下将ARC-Challenge的性能从29.1提高到32.0。我们的工作强调了联合考虑模型参数和词汇量以实现高效扩展的必要性。

English

Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. % Intuitively, larger vocabularies enable more efficient tokenization by representing sentences with fewer tokens, but they also increase the risk of under-fitting representations for rare tokens. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Our approaches converge on the same result that the optimal vocabulary size depends on the available compute budget and that larger models deserve larger vocabularies. However, most LLMs use too small vocabulary sizes. For example, we predict that the optimal vocabulary size of Llama2-70B should have been at least 216K, 7 times larger than its vocabulary of 32K. We validate our predictions empirically by training models with 3B parameters across different FLOPs budgets. Adopting our predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes. By increasing the vocabulary size from the conventional 32K to 43K, we improve performance on ARC-Challenge from 29.1 to 32.0 with the same 2.3e21 FLOPs. Our work emphasizes the necessity of jointly considering model parameters and vocabulary size for efficient scaling.

随着词汇量的增加而产生的规模定律：更大的模型应配备更大的词汇量

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

摘要

Support