參數與 FLOPs：混合專家語言模型最佳稀疏度的擴展定律

摘要

擴展語言模型的容量一直被證實是提高性能和開啟新功能的可靠方法。容量主要由兩個維度來定義：模型參數的數量和每個示例的計算。雖然擴展通常涉及增加兩者，但這些因素之間的精確相互作用及它們對整體容量的組合貢獻仍未完全理解。我們在稀疏的專家混合（MoEs）的背景下探討這種關係，這種方法允許擴展模型參數的數量而不成比例地增加每個示例的 FLOPs。我們研究了不同稀疏程度（即非活躍參數的比例）如何影響模型在預訓練和下游少樣本評估期間的性能。我們發現在不同的限制條件下（例如參數大小和總訓練計算量），存在一個最佳稀疏水平，可以提高訓練效率和模型性能。這些結果更好地理解了稀疏對MoEs的擴展規律的影響，並補充了這一領域現有的研究，為設計更有效率的架構提供了見解。

English

Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Experts (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the fraction of inactive parameters, impacts model's performance during pretraining and downstream few-shot evaluation. We find that under different constraints (e.g., parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance. These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.

參數與 FLOPs：混合專家語言模型最佳稀疏度的擴展定律

Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

摘要

Support