동일한 아키텍처, 다른 용량: 최적화기 유도 스펙트럼 스케일링 법칙

초록

스케일링 법칙은 언어 모델의 성능을 모델 크기, 데이터, 계산량으로부터 예측 가능하게 만들었지만, 일반적으로 옵티마이저를 고정된 학습 세부 사항으로 취급한다. 본 연구는 이러한 가정이 표현 스케일링의 근본적인 축, 즉 옵티마이저가 추가된 FFN 너비를 활용된 스펙트럼 용량으로 얼마나 효과적으로 변환하는지를 간과함을 보여준다. 소프트 및 하드 스펙트럼 랭크를 통해 측정된 피드포워드 네트워크 표현의 고유스펙트럼을 사용하여, 동일한 Transformer 아키텍처가 서로 다른 옵티마이저로 훈련될 때 현저히 다른 스펙트럼 스케일링 법칙을 구현함을 발견했다. 아키텍처와 폭 스케줄을 고정했을 때, AdamW는 학습이 가장 어려운 것으로 알려진 희소 토큰(TAIL) 표현에서 약한 하드 랭크 스케일링(β=0.44)을 보이는 반면, Muon은 동일한 영역에서 선형 스케일링(β=1.02)을 달성하여 스케일링 지수가 2.3배 증가했다. 이러한 차이는 검증 손실로 환원될 수 없다. AdamW 설정은 확장된 훈련 하에서 낮은 랭크의 Dion 변형과 퍼플렉서티에서 일치할 수 있지만, 스펙트럼 기하에서는 뚜렷한 차이를 보여, 손실이 일치한다고 해서 표현 구조가 일치함을 의미하지 않음을 입증한다. 또한 하드-소프트 랭크 비대칭은 옵티마이저가 실현되는 용량의 양뿐만 아니라 그 용량이 고유모드에 걸쳐 구조화되는 방식에서도 다르다는 것을 보여준다. 옵티마이저 효과를 아키텍처 효과로부터 분리하기 위해, 우리는 아키텍처 개입(예: 어텐션 랭크 및 위치 인코딩)과 비교했으며, 옵티마이저 유발 스펙트럼 이동이 종종 아키텍처 효과를 초과함을 발견했다. 이러한 결과들은 최적화가 표현 스케일링의 일급 축임을 시사하며, 옵티마이저-아키텍처 공동 설계의 필요성을 제기한다.

English

Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling (β=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling (β=1.02) in the same regimes, a 2.3times increase in the scaling exponent. This difference is not reducible to validation loss: AdamW configurations can match low-rank Dion variants in perplexity, under extended training, while exhibiting sharply different spectral geometry, demonstrating that matched loss does not imply matched representation structure. Hard--soft rank asymmetry further reveals that optimizers differ not only in how much capacity is realized, but also in how that capacity is structured across eigenmodes. To disentangle optimizer effects from architectural ones, we compare against architectural interventions (e.g., attention rank and positional encoding), and find that optimizer-induced spectral shifts often exceed the architectural effects. These results suggest optimization as a first-class axis of representation scaling, motivating optimizer--architecture co-design.