通过多令牌预测实现更好更快的大型语言模型

摘要

大型语言模型如GPT和Llama是通过下一个标记预测损失进行训练的。在这项工作中，我们建议训练语言模型同时预测多个未来标记，这样可以提高样本效率。更具体地，在训练语料库的每个位置，我们要求模型使用n个独立的输出头同时预测接下来的n个标记，这些输出头在共享的模型主干上运行。将多标记预测视为辅助训练任务，我们发现在代码和自然语言模型的训练时间中没有额外开销的情况下，可以衡量出下游能力的提高。这种方法在更大的模型规模上尤其有用，并且在训练多个时期时仍然具有吸引力。在生成基准测试中，收益尤为显著，我们的模型在编码等任务上始终比强基线表现出色几个百分点。我们的130亿参数模型在HumanEval上解决的问题比可比的下一个标记模型多12％，在MBPP上多17％。对小型算法任务的实验表明，多标记预测有利于归纳头部和算法推理能力的发展。作为额外好处，使用4个标记预测训练的模型在推理时速度最多快3倍，即使批处理大小很大。

English

Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes, and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3 times faster at inference, even with large batch sizes.

通过多令牌预测实现更好更快的大型语言模型

Better & Faster Large Language Models via Multi-token Prediction

摘要

Support