Languini厨房:在不同计算规模下支持语言建模研究
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute
September 20, 2023
作者: Aleksandar Stanić, Dylan Ashley, Oleg Serikov, Louis Kirsch, Francesco Faccio, Jürgen Schmidhuber, Thomas Hofmann, Imanol Schlag
cs.AI
摘要
Languini Kitchen既是一个研究集体,也是一个代码库,旨在赋予计算资源有限的研究人员对语言建模领域做出有意义贡献的能力。我们引入了一种实验性协议,使模型比较基于等效计算,以加速器小时计量。模型训练的标记数量由模型的吞吐量和所选的计算类别定义。值得注意的是,这种方法避免了对影响总参数或浮点运算的关键超参数的限制。为了评估,我们对现有的大型、多样化和高质量的书籍数据集进行预处理,该数据集在质量、多样性和文档长度方面超过了现有的学术基准。我们在此基础上比较基于它们在不同计算级别上的实验估计的经验性扩展趋势的方法。这项工作还提供了两个基准模型:一个是基于GPT-2架构的前馈模型,另一个是以十倍吞吐量的新型LSTM形式的递归模型。尽管GPT基准在我们所有的计算级别上都实现了更好的困惑度,但我们的LSTM基准展现出了可预测且更有利的扩展规律。这是由于改进的吞吐量和需要更少的训练标记来实现相同测试困惑度减少的原因。推断两种模型的扩展规律结果在大约50,000个加速器小时处相交。我们希望这项工作能够成为有意义且可重复的语言建模研究的基础。
English
The Languini Kitchen serves as both a research collective and codebase
designed to empower researchers with limited computational resources to
contribute meaningfully to the field of language modelling. We introduce an
experimental protocol that enables model comparisons based on equivalent
compute, measured in accelerator hours. The number of tokens on which a model
is trained is defined by the model's throughput and the chosen compute class.
Notably, this approach avoids constraints on critical hyperparameters which
affect total parameters or floating-point operations. For evaluation, we
pre-process an existing large, diverse, and high-quality dataset of books that
surpasses existing academic benchmarks in quality, diversity, and document
length. On it, we compare methods based on their empirical scaling trends which
are estimated through experiments at various levels of compute. This work also
provides two baseline models: a feed-forward model derived from the GPT-2
architecture and a recurrent model in the form of a novel LSTM with ten-fold
throughput. While the GPT baseline achieves better perplexity throughout all
our levels of compute, our LSTM baseline exhibits a predictable and more
favourable scaling law. This is due to the improved throughput and the need for
fewer training tokens to achieve the same decrease in test perplexity.
Extrapolating the scaling laws leads of both models results in an intersection
at roughly 50,000 accelerator hours. We hope this work can serve as the
foundation for meaningful and reproducible language modelling research.