Languini Kitchen：異なる計算規模での言語モデリング研究を可能にする

要旨

Languini Kitchenは、研究コレクティブとコードベースの両方として機能し、限られた計算リソースを持つ研究者が言語モデリング分野に有意義な貢献ができるよう設計されています。本稿では、アクセラレータ時間に基づく等価な計算量でモデル比較を可能にする実験プロトコルを紹介します。モデルが訓練されるトークン数は、モデルのスループットと選択された計算クラスによって定義されます。特に、このアプローチでは、総パラメータ数や浮動小数点演算数に影響を与える重要なハイパーパラメータに対する制約を回避しています。評価のために、既存の学術ベンチマークを品質、多様性、文書長の点で凌駕する大規模で多様かつ高品質な書籍データセットを前処理します。これを用いて、さまざまな計算量レベルでの実験を通じて推定された経験的スケーリングトレンドに基づいて手法を比較します。また、本稿では2つのベースラインモデルを提供します：GPT-2アーキテクチャに基づくフィードフォワードモデルと、10倍のスループットを実現する新規LSTMのリカレントモデルです。GPTベースラインはすべての計算量レベルでより良いパープレキシティを達成しますが、LSTMベースラインは予測可能でより有利なスケーリング則を示します。これは、改善されたスループットと、テストパープレキシティを同じだけ減少させるために必要な訓練トークン数が少ないためです。両モデルのスケーリング則を外挿すると、約50,000アクセラレータ時間で交差します。本研究が、有意義で再現可能な言語モデリング研究の基盤となることを期待しています。

English

The Languini Kitchen serves as both a research collective and codebase designed to empower researchers with limited computational resources to contribute meaningfully to the field of language modelling. We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. The number of tokens on which a model is trained is defined by the model's throughput and the chosen compute class. Notably, this approach avoids constraints on critical hyperparameters which affect total parameters or floating-point operations. For evaluation, we pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. On it, we compare methods based on their empirical scaling trends which are estimated through experiments at various levels of compute. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput. While the GPT baseline achieves better perplexity throughout all our levels of compute, our LSTM baseline exhibits a predictable and more favourable scaling law. This is due to the improved throughput and the need for fewer training tokens to achieve the same decrease in test perplexity. Extrapolating the scaling laws leads of both models results in an intersection at roughly 50,000 accelerator hours. We hope this work can serve as the foundation for meaningful and reproducible language modelling research.

Languini Kitchen：異なる計算規模での言語モデリング研究を可能にする

The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute

要旨

Support