稀疏连接基础模型的规模定律
Scaling Laws for Sparsely-Connected Foundation Models
September 15, 2023
作者: Elias Frantar, Carlos Riquelme, Neil Houlsby, Dan Alistarh, Utku Evci
cs.AI
摘要
我们探讨了参数稀疏性对在大规模数据集上训练的Transformer(即“基础模型”)在视觉和语言领域的扩展行为的影响。在这种情况下,我们确定了描述权重稀疏性、非零参数数量和训练数据量之间关系的第一个扩展定律,我们在ViT/JFT-4B和T5/C4模型和数据规模上进行了实证验证。这些结果使我们能够表征“最佳稀疏度”,即为给定有效模型大小和训练预算提供最佳性能的稀疏水平。对于固定数量的非零参数,我们发现最佳稀疏度随着用于训练的数据量的增加而增加。我们还将研究扩展到不同的稀疏结构(如硬件友好的n:m模式)和策略(如从预训练的稠密模型开始)。我们的研究结果揭示了在各种参数和计算设置下权重稀疏性的能力和局限性,为利用稀疏性实现计算效率改进提供了理论理解和实际启示。
English
We explore the impact of parameter sparsity on the scaling behavior of
Transformers trained on massive datasets (i.e., "foundation models"), in both
vision and language domains. In this setting, we identify the first scaling law
describing the relationship between weight sparsity, number of non-zero
parameters, and amount of training data, which we validate empirically across
model and data scales; on ViT/JFT-4B and T5/C4. These results allow us to
characterize the "optimal sparsity", the sparsity level which yields the best
performance for a given effective model size and training budget. For a fixed
number of non-zero parameters, we identify that the optimal sparsity increases
with the amount of data used for training. We also extend our study to
different sparsity structures (such as the hardware-friendly n:m pattern) and
strategies (such as starting from a pretrained dense model). Our findings shed
light on the power and limitations of weight sparsity across various parameter
and computational settings, offering both theoretical understanding and
practical implications for leveraging sparsity towards computational efficiency
improvements.