Schaalwetten met vocabulaire: Grotere modellen verdienen grotere vocabulaire

Samenvatting

Onderzoek naar het schalen van grote taalmodelen (LLMs) heeft zich voornamelijk gericht op modelparameters en de omvang van trainingsdata, waarbij de rol van vocabulairegrootte over het hoofd is gezien. Intuïtief gezien maken grotere vocabulaire efficiëntere tokenisatie mogelijk door zinnen met minder tokens weer te geven, maar ze vergroten ook het risico van onderfitting van representaties voor zeldzame tokens. Wij onderzoeken hoe vocabulairegrootte de schaalwetten van LLMs beïnvloedt door modellen te trainen variërend van 33M tot 3B parameters op maximaal 500B tekens met verschillende vocabulaireconfiguraties. We stellen drie complementaire benaderingen voor om de compute-optimale vocabulairegrootte te voorspellen: IsoFLOPs-analyse, schatting van afgeleiden, en parametrische aanpassing van de verliesfunctie. Onze benaderingen komen tot hetzelfde resultaat: de optimale vocabulairegrootte hangt af van het beschikbare compute-budget en grotere modellen verdienen grotere vocabulaire. Echter, de meeste LLMs gebruiken te kleine vocabulairegroottes. Zo voorspellen we bijvoorbeeld dat de optimale vocabulairegrootte van Llama2-70B minstens 216K had moeten zijn, 7 keer groter dan zijn vocabulaire van 32K. We valideren onze voorspellingen empirisch door modellen met 3B parameters te trainen over verschillende FLOPs-budgetten. Het overnemen van onze voorspelde optimale vocabulairegrootte verbetert consequent de downstream-prestaties vergeleken met veelgebruikte vocabulairegroottes. Door de vocabulairegrootte te verhogen van de conventionele 32K naar 43K, verbeteren we de prestaties op ARC-Challenge van 29.1 naar 32.0 met dezelfde 2.3e21 FLOPs. Ons werk benadrukt de noodzaak om modelparameters en vocabulairegrootte gezamenlijk te overwegen voor efficiënt schalen.

English

Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. % Intuitively, larger vocabularies enable more efficient tokenization by representing sentences with fewer tokens, but they also increase the risk of under-fitting representations for rare tokens. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Our approaches converge on the same result that the optimal vocabulary size depends on the available compute budget and that larger models deserve larger vocabularies. However, most LLMs use too small vocabulary sizes. For example, we predict that the optimal vocabulary size of Llama2-70B should have been at least 216K, 7 times larger than its vocabulary of 32K. We validate our predictions empirically by training models with 3B parameters across different FLOPs budgets. Adopting our predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes. By increasing the vocabulary size from the conventional 32K to 43K, we improve performance on ARC-Challenge from 29.1 to 32.0 with the same 2.3e21 FLOPs. Our work emphasizes the necessity of jointly considering model parameters and vocabulary size for efficient scaling.

Schaalwetten met vocabulaire: Grotere modellen verdienen grotere vocabulaire

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Samenvatting

Support