最優稀疏性專家混合語言模型在推理任務中的應用

摘要

實證規模法則推動了大規模語言模型（LLMs）的演進，然而每當模型架構或數據管道變更時，其係數便會發生偏移。專家混合模型（Mixture-of-Experts, MoE）作為現今頂尖系統中的標準配置，引入了一種新的稀疏維度，這是當前密集模型前沿所忽視的。我們探討了MoE稀疏性如何影響兩種不同的能力範疇：記憶與推理。我們訓練了一系列MoE Transformer模型，這些模型在保持計算預算不變的同時，系統性地改變了總參數量、激活參數量以及top-k路由策略。對於每一個模型，我們記錄了預訓練損失、下游任務損失及任務準確率，從而能夠將訓練-測試泛化差距與損失-準確率差距分離開來。記憶基準測試隨著總參數量的增加而單調提升，與訓練損失的變化相呼應。相比之下，推理性能會達到飽和，甚至在總參數量和訓練損失持續改善的情況下也可能出現倒退。當激活參數量保持不變時，僅改變top-k策略影響甚微，而學習率和初始化等經典超參數則以與稀疏性相同的方向調節泛化差距。無論是訓練後的強化學習（GRPO）還是額外的測試時計算資源，都無法挽救過於稀疏模型在推理能力上的不足。我們的模型檢查點、代碼及日誌已開源於https://github.com/rioyokotalab/optimal-sparsity。

English

Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization and reasoning. We train families of MoE Transformers that systematically vary total parameters, active parameters, and top-k routing while holding the compute budget fixed. For every model we record pre-training loss, downstream task loss, and task accuracy, allowing us to separate the train-test generalization gap from the loss-accuracy gap. Memorization benchmarks improve monotonically with total parameters, mirroring training loss. By contrast, reasoning performance saturates and can even regress despite continued gains in both total parameters and training loss. Altering top-k alone has little effect when active parameters are constant, and classic hyperparameters such as learning rate and initialization modulate the generalization gap in the same direction as sparsity. Neither post-training reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning deficit of overly sparse models. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.

最優稀疏性專家混合語言模型在推理任務中的應用

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

摘要

Support