透過多標記預測實現更好更快的大型語言模型

摘要

像GPT和Llama這樣的大型語言模型是通過預測下一個token來進行訓練的。在這項工作中，我們建議訓練語言模型同時預測多個未來token，可以提高樣本效率。更具體地，在訓練語料庫的每個位置，我們要求模型使用n個獨立的輸出頭，基於共享的模型主幹，來預測接下來的n個token。將多token預測視為輔助訓練任務，我們測量了在代碼和自然語言模型中提高下游能力的效果，而在訓練時間上並沒有額外的開銷。這種方法對於更大的模型尺寸特別有用，並且在進行多個時期的訓練時仍然具有吸引力。在生成基準測試中，效益尤為明顯，我們的模型在編碼等方面始終比強基準高出幾個百分點。我們的130億參數模型在HumanEval上解決的問題比可比的下一個token模型多12％，在MBPP上多17％。對小型算法任務的實驗表明，多token預測有利於歸納頭和算法推理能力的發展。作為一個額外的好處，使用4個token預測訓練的模型在推理時速度提高了3倍，即使使用大批量大小。

English

Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes, and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3 times faster at inference, even with large batch sizes.

透過多標記預測實現更好更快的大型語言模型

Better & Faster Large Language Models via Multi-token Prediction

摘要

Support