相同架构，不同容量：优化器引发的谱缩放定律

摘要

缩放定律使语言模型的性能可以根据模型规模、数据量和计算量进行预测，但通常将优化器视为固定的训练细节。我们发现这一假设忽略了表示缩放的一个基本维度：优化器将增加的FFN宽度转化为有效谱容量的效率。通过测量前馈网络表示的特征谱（分别使用软谱秩和硬谱秩），我们发现相同的Transformer架构在使用不同优化器训练时，会实现显著不同的谱缩放定律。在保持架构和宽度调度固定的情况下，对于已知学习难度最大的稀有令牌表示区域，AdamW表现出较弱的硬谱秩缩放（β=0.44），而Muon在同一区域实现了线性缩放（β=1.02），缩放指数提升了2.3倍。这一差异无法归结为验证损失：在延长训练时间的情况下，AdamW配置可以在困惑度上匹配低秩的Dion变体，但谱几何结构显著不同，这表明损失匹配并不意味着表示结构匹配。硬-软谱秩不对称性进一步揭示，优化器的差异不仅体现在实现的容量大小上，还体现在这些容量在特征模式间的结构方式上。为了将优化器效应与架构效应分离，我们将其与架构干预措施（如注意力秩和位置编码）进行比较，发现优化器引起的谱偏移往往超过架构效应。这些结果表明，优化应作为表示缩放的一级轴，从而推动优化器与架构协同设计的研究。

English

Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling (β=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling (β=1.02) in the same regimes, a 2.3times increase in the scaling exponent. This difference is not reducible to validation loss: AdamW configurations can match low-rank Dion variants in perplexity, under extended training, while exhibiting sharply different spectral geometry, demonstrating that matched loss does not imply matched representation structure. Hard--soft rank asymmetry further reveals that optimizers differ not only in how much capacity is realized, but also in how that capacity is structured across eigenmodes. To disentangle optimizer effects from architectural ones, we compare against architectural interventions (e.g., attention rank and positional encoding), and find that optimizer-induced spectral shifts often exceed the architectural effects. These results suggest optimization as a first-class axis of representation scaling, motivating optimizer--architecture co-design.