パラメータとFLOPs：混合専門家言語モデルの最適スパース性のためのスケーリング則

要旨

言語モデルの能力を拡張することは、性能を向上させ、新しい機能を開放するための信頼性のあるアプローチであることが一貫して示されています。能力は主に、モデルパラメータの数と1例あたりの計算量によって定義されます。拡張には通常、両方を増やすことが含まれますが、これらの要因の正確な相互作用と全体的な能力への組み合わせ効果は完全に理解されていません。私たちは、スパースな専門家の混合（MoEs）の文脈でこの関係を探求します。これにより、モデルパラメータの数を増やすことなく1例あたりのFLOPsを比例して増やさないことが可能となります。非アクティブなパラメータの割合であるスパース度を変化させることが、事前学習およびダウンストリームの少数ショット評価中にモデルの性能にどのように影響するかを調査します。異なる制約条件（例：パラメータサイズと総トレーニング計算量）の下で、効率的なトレーニングとモデルの性能の両方を向上させる最適なスパース度が存在することがわかります。これらの結果は、MoEsのスケーリング法則におけるスパース度の影響をよりよく理解し、この分野の既存の研究を補完するものであり、より効率的なアーキテクチャを設計するための示唆を提供します。

English

Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Experts (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the fraction of inactive parameters, impacts model's performance during pretraining and downstream few-shot evaluation. We find that under different constraints (e.g., parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance. These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.

パラメータとFLOPs：混合専門家言語モデルの最適スパース性のためのスケーリング則

Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

要旨

Support