ChatPaper.aiChatPaper

Languini廚房:在不同計算規模下促進語言建模研究

The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute

September 20, 2023
作者: Aleksandar Stanić, Dylan Ashley, Oleg Serikov, Louis Kirsch, Francesco Faccio, Jürgen Schmidhuber, Thomas Hofmann, Imanol Schlag
cs.AI

摘要

Languini Kitchen 既是一個研究集體,也是一個程式碼庫,旨在賦予計算資源有限的研究人員能力,以對語言建模領域做出有意義的貢獻。我們引入了一個實驗性協議,使模型比較基於等效計算,以加速器小時為度量單位。模型訓練的標記數量由模型的吞吐量和所選計算類別所定義。值得注意的是,這種方法避免了對影響總參數或浮點運算的關鍵超參數施加限制。為了評估,我們對現有的大型、多樣化和高質量的書籍數據集進行預處理,該數據集在質量、多樣性和文件長度方面超越現有的學術基準。我們在此基礍上比較了基於實證縮放趨勢的方法,這些趨勢是通過在不同計算級別進行實驗來估計的。此工作還提供了兩個基準模型:一個是從 GPT-2 架構衍生的前饋模型,另一個是一個具有十倍吞吐量的新型 LSTM 循環模型。儘管 GPT 基準在我們所有的計算級別上都實現了更好的困惑度,我們的 LSTM 基準展現了一個可預測且更有利的縮放定律。這是由於改進的吞吐量以及實現相同測試困惑度減少所需的訓練標記較少。對兩個模型的縮放定律進行外推,結果在大約 50,000 加速器小時處相交。我們希望這項工作能成為有意義且可重現的語言建模研究的基礎。
English
The Languini Kitchen serves as both a research collective and codebase designed to empower researchers with limited computational resources to contribute meaningfully to the field of language modelling. We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. The number of tokens on which a model is trained is defined by the model's throughput and the chosen compute class. Notably, this approach avoids constraints on critical hyperparameters which affect total parameters or floating-point operations. For evaluation, we pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. On it, we compare methods based on their empirical scaling trends which are estimated through experiments at various levels of compute. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput. While the GPT baseline achieves better perplexity throughout all our levels of compute, our LSTM baseline exhibits a predictable and more favourable scaling law. This is due to the improved throughput and the need for fewer training tokens to achieve the same decrease in test perplexity. Extrapolating the scaling laws leads of both models results in an intersection at roughly 50,000 accelerator hours. We hope this work can serve as the foundation for meaningful and reproducible language modelling research.
PDF51December 15, 2024