相同架構，不同容量：優化器誘導的頻譜縮放律

摘要

縮放定律使得語言模型性能可從模型大小、數據量和計算量中預測，但這些定律通常將優化器視為固定的訓練細節。我們證明此假設忽略了表徵縮放的一個基本面向：優化器如何有效地將新增的FFN寬度轉化為可用的頻譜容量。透過前饋網路表徵的特徵譜（以軟、硬頻譜秩衡量），我們發現即使採用相同的Transformer架構，若使用不同的優化器進行訓練，會實現截然不同的頻譜縮放定律。在固定架構與寬度調度下，AdamW在稀有詞元（TAIL）表徵上表現出較弱的硬秩縮放（β=0.44），而該區域已知是學習最困難之處；相較之下，Muon在同一區域達到線性縮放（β=1.02），縮放指數提升了2.3倍。此差異無法簡化為驗證損失：AdamW配置可在延長訓練後，於困惑度上匹配低秩的Dion變體，但卻呈現截然不同的頻譜幾何，證明損失匹配並不意味著表徵結構匹配。硬－軟秩不對稱進一步揭示，優化器不僅在實現的容量多寡上有所差異，也在該容量如何分佈於不同特徵模態上有所不同。為區分優化器效應與架構效應，我們將其與架構干預（如注意力秩與位置編碼）進行比較，發現優化器引起的頻譜偏移往往超越架構效應。這些結果顯示，優化應被視為表徵縮放的第一級面向，從而激發優化器－架構的協同設計。

English

Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling (β=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling (β=1.02) in the same regimes, a 2.3times increase in the scaling exponent. This difference is not reducible to validation loss: AdamW configurations can match low-rank Dion variants in perplexity, under extended training, while exhibiting sharply different spectral geometry, demonstrating that matched loss does not imply matched representation structure. Hard--soft rank asymmetry further reveals that optimizers differ not only in how much capacity is realized, but also in how that capacity is structured across eigenmodes. To disentangle optimizer effects from architectural ones, we compare against architectural interventions (e.g., attention rank and positional encoding), and find that optimizer-induced spectral shifts often exceed the architectural effects. These results suggest optimization as a first-class axis of representation scaling, motivating optimizer--architecture co-design.