語言模型的演算法進展

摘要

我們研究了自深度學習出現以來，用於預訓練語言模型的演算法改善速度。利用跨越2012年至2023年的超過200個Wikitext和Penn Treebank語言模型評估數據集，我們發現達到一定性能閾值所需的計算量大約每8個月減半一次，95%的置信區間約為5至14個月，遠快於摩爾定律下的硬體增長。我們估計了擴增的縮放定律，這使我們能夠量化演算法進展並確定模型縮放與訓練演算法創新之間的相對貢獻。儘管演算法進展迅速且出現了新的架構，如Transformer，但我們的分析顯示，計算量的增加在這段時間內對整體性能改進的貢獻更大。儘管受到嘈雜的基準數據的限制，我們的分析量化了語言建模的快速進展，闡明了計算量和演算法對相對貢獻的情況。

English

We investigate the rate at which algorithms for pre-training language models have improved since the advent of deep learning. Using a dataset of over 200 language model evaluations on Wikitext and Penn Treebank spanning 2012-2023, we find that the compute required to reach a set performance threshold has halved approximately every 8 months, with a 95% confidence interval of around 5 to 14 months, substantially faster than hardware gains per Moore's Law. We estimate augmented scaling laws, which enable us to quantify algorithmic progress and determine the relative contributions of scaling models versus innovations in training algorithms. Despite the rapid pace of algorithmic progress and the development of new architectures such as the transformer, our analysis reveals that the increase in compute made an even larger contribution to overall performance improvements over this time period. Though limited by noisy benchmark data, our analysis quantifies the rapid progress in language modeling, shedding light on the relative contributions from compute and algorithms.

語言模型的演算法進展

Algorithmic progress in language models

摘要

Support