大規模言語モデルにおける深さの呪いをスパース性が緩和する条件

要旨

近年の研究では、大規模言語モデル（LLM）における「深さの呪い」が実証されている。これは、下位層に比べて上位層の学習および表現への寄与が小さくなる現象である。このような利用不足は、Pre-Layer Normalizationにおける分散の累積的な増大に関連しており、深いブロックをほぼ恒等写像の挙動に近づける可能性がある。本論文では、スパース性が効率化を可能にするだけでなく、分散伝播の調整役として機能し、それによって深さの利用効率を改善することを示す。我々の調査は、以下の二つのスパース性の源を対象とする：(i) 重み減衰によって誘発される重みのスパース性や、長文コンテキスト入力によって誘発される注意機構のスパース性など、訓練とデータ条件から生じる暗黙的スパース性；(ii) Grouped-Query Attentionにおけるキー/値共有スパース性や、Mixture-of-Expertsにおける専門家活性化スパース性など、アーキテクチャ設計によって強制される明示的スパース性である。我々の主張は、制御された深度スケーリング実験と層の効果に焦点を当てた介入実験によって十分に支持されている。様々な設定を通じて、一貫した関係性を観察した：スパース性は、出力分散を低減し機能的分化を促進することによって、層の利用効率を改善する。最終的に、我々の発見を実用的な経験則として結晶化し、深度効率の良いLLMを訓練するためのレシピを提案する。これにより、下流タスクにおいて顕著な4.6%の精度向上が得られた。我々の結果は、標準的な設計選択から自然に生じるスパース性が、LLMにおける効果的な深度スケーリングのための、重要でありながら従来見過ごされてきたメカニズムであることを明らかにする。コードはhttps://github.com/pUmpKin-Co/SparsityAndCoDで公開されている。

English

Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation covers two sources of sparsity: (i) implicit sparsity, which emerges from training and data conditions, including weight sparsity induced by weight decay and attention sparsity induced by long context inputs; and (ii) explicit sparsity, which is enforced by architectural design, including key/value-sharing sparsity in Grouped-Query Attention and expert-activation sparsity in Mixtureof-Experts. Our claim is thoroughly supported by controlled depth-scaling experiments and targeted layer effectiveness interventions. Across settings, we observe a consistent relationship: sparsity improves layer utilization by reducing output variance and promoting functional differentiation. We eventually distill our findings into a practical rule-of-thumb recipe for training deptheffective LLMs, yielding a notable 4.6% accuracy improvement on downstream tasks. Our results reveal sparsity, arising naturally from standard design choices, as a key yet previously overlooked mechanism for effective depth scaling in LLMs. Code is available at https://github.com/pUmpKin-Co/SparsityAndCoD.

大規模言語モデルにおける深さの呪いをスパース性が緩和する条件

When Does Sparsity Mitigate the Curse of Depth in LLMs

要旨

Support