同じアーキテクチャ、異なる容量：最適化器によるスペクトルスケーリング則

要旨

スケーリング則により、言語モデルの性能はモデルサイズ、データ、計算量から予測可能となったが、通常はオプティマイザを固定された訓練の詳細として扱っている。我々は、この仮定が表現スケーリングの基本的な軸を見落としていることを示す。すなわち、オプティマイザが追加されたFFN幅をどの程度効果的に利用されたスペクトル容量に変換するかである。フィードフォワードネットワーク表現の固有スペクトルを、ソフトおよびハードスペクトルランクを通じて測定することで、同じTransformerアーキテクチャでも異なるオプティマイザで訓練すると、著しく異なるスペクトルスケーリング則が実現されることを発見した。アーキテクチャと幅スケジュールを固定した場合、AdamWは学習が最も困難であることが知られている稀なトークン（TAIL）表現において弱いハードランクスケーリング（β=0.44）を示すのに対し、Muonは同じ条件下で線形スケーリング（β=1.02）を達成し、スケーリング指数が2.3倍増加している。この差は検証損失に還元できない。AdamWの構成は、訓練を延長することで、低ランクのDion変種とパープレキシティを一致させることができるが、スペクトル幾何は著しく異なっており、損失が一致しても表現構造が一致するわけではないことを示している。ハード-ソフトランクの非対称性はさらに、オプティマイザが実現される容量の大きさだけでなく、その容量が固有モード間でどのように構造化されるかにおいても異なることを明らかにしている。オプティマイザの効果とアーキテクチャの効果を区別するために、我々はアーキテクチャ介入（例：アテンションランクや位置エンコーディング）と比較し、オプティマイザによって引き起こされるスペクトルシフトがアーキテクチャの効果をしばしば上回ることを発見した。これらの結果は、最適化を表現スケーリングの第一級の軸として位置づけることを示唆しており、オプティマイザとアーキテクチャの共同設計を動機づける。

English

Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling (β=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling (β=1.02) in the same regimes, a 2.3times increase in the scaling exponent. This difference is not reducible to validation loss: AdamW configurations can match low-rank Dion variants in perplexity, under extended training, while exhibiting sharply different spectral geometry, demonstrating that matched loss does not imply matched representation structure. Hard--soft rank asymmetry further reveals that optimizers differ not only in how much capacity is realized, but also in how that capacity is structured across eigenmodes. To disentangle optimizer effects from architectural ones, we compare against architectural interventions (e.g., attention rank and positional encoding), and find that optimizer-induced spectral shifts often exceed the architectural effects. These results suggest optimization as a first-class axis of representation scaling, motivating optimizer--architecture co-design.