稀疏連接基礎模型的規模定律
Scaling Laws for Sparsely-Connected Foundation Models
September 15, 2023
作者: Elias Frantar, Carlos Riquelme, Neil Houlsby, Dan Alistarh, Utku Evci
cs.AI
摘要
我們探討參數稀疏性對在大規模數據集上訓練的Transformer(即“基礎模型”)在視覺和語言領域中的擴展行為的影響。在這種情況下,我們確定了描述權重稀疏性、非零參數數量和訓練數據量之間關係的第一擴展定律,我們在ViT/JFT-4B和T5/C4模型和數據規模上通過實證方法進行了驗證。這些結果使我們能夠表徵“最佳稀疏度”,即對於給定有效模型大小和訓練預算而言產生最佳性能的稀疏度水平。對於固定數量的非零參數,我們確定了隨著用於訓練的數據量增加,最佳稀疏度也會增加。我們還將研究擴展到不同的稀疏結構(如硬件友好的n:m模式)和策略(如從預訓練的密集模型開始)。我們的研究結果揭示了在各種參數和計算設置中權重稀疏性的優勢和局限性,為利用稀疏性實現計算效率改進提供了理論理解和實際意義。
English
We explore the impact of parameter sparsity on the scaling behavior of
Transformers trained on massive datasets (i.e., "foundation models"), in both
vision and language domains. In this setting, we identify the first scaling law
describing the relationship between weight sparsity, number of non-zero
parameters, and amount of training data, which we validate empirically across
model and data scales; on ViT/JFT-4B and T5/C4. These results allow us to
characterize the "optimal sparsity", the sparsity level which yields the best
performance for a given effective model size and training budget. For a fixed
number of non-zero parameters, we identify that the optimal sparsity increases
with the amount of data used for training. We also extend our study to
different sparsity structures (such as the hardware-friendly n:m pattern) and
strategies (such as starting from a pretrained dense model). Our findings shed
light on the power and limitations of weight sparsity across various parameter
and computational settings, offering both theoretical understanding and
practical implications for leveraging sparsity towards computational efficiency
improvements.