稀疏連接基礎模型的規模定律

摘要

我們探討參數稀疏性對在大規模數據集上訓練的Transformer（即“基礎模型”）在視覺和語言領域中的擴展行為的影響。在這種情況下，我們確定了描述權重稀疏性、非零參數數量和訓練數據量之間關係的第一擴展定律，我們在ViT/JFT-4B和T5/C4模型和數據規模上通過實證方法進行了驗證。這些結果使我們能夠表徵“最佳稀疏度”，即對於給定有效模型大小和訓練預算而言產生最佳性能的稀疏度水平。對於固定數量的非零參數，我們確定了隨著用於訓練的數據量增加，最佳稀疏度也會增加。我們還將研究擴展到不同的稀疏結構（如硬件友好的n:m模式）和策略（如從預訓練的密集模型開始）。我們的研究結果揭示了在各種參數和計算設置中權重稀疏性的優勢和局限性，為利用稀疏性實現計算效率改進提供了理論理解和實際意義。

English

We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i.e., "foundation models"), in both vision and language domains. In this setting, we identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data, which we validate empirically across model and data scales; on ViT/JFT-4B and T5/C4. These results allow us to characterize the "optimal sparsity", the sparsity level which yields the best performance for a given effective model size and training budget. For a fixed number of non-zero parameters, we identify that the optimal sparsity increases with the amount of data used for training. We also extend our study to different sparsity structures (such as the hardware-friendly n:m pattern) and strategies (such as starting from a pretrained dense model). Our findings shed light on the power and limitations of weight sparsity across various parameter and computational settings, offering both theoretical understanding and practical implications for leveraging sparsity towards computational efficiency improvements.

稀疏連接基礎模型的規模定律

Scaling Laws for Sparsely-Connected Foundation Models

摘要

Support