최적화기 설계를 위한 대칭 호환 원리: 임베딩, 언어 모델 헤드, SwiGLU 다층 퍼셉트론 및 MoE 라우터

초록

딥러닝 실무에서는 오랫동안 눈에 띄는 기하학적 불일치가 지속되어 왔다. 현대 신경망 아키텍처는 자연스럽게 풍부한 대칭성과 등변성 성질을 나타내는 반면, Adam 및 그 변형과 같은 널리 사용되는 최적화기는 본질적으로 좌표 단위로 동작하여 매개변수 공간의 등변성 구조를 존중하지 못한다. 우리는 대칭 호환 가능한 최적화기 설계 원칙, 즉 기울기 갱신 규칙이 해당 가중치 블록에 작용하는 대칭군 하에서 등변성이어야 한다는 원칙을 도입함으로써 이러한 불일치를 해결한다. 이 원칙에 따라, 먼저 확률적 스펙트럴 강하, Muon, Scion, 극좌표 기울기 방법에서 사용되는 일반 행렬 계층에 대한 쌍직교 등변 갱신에 대한 통일된 관점을 제시한다. 더 중요하게는, 직교군에서 순열 및 공유 이동 대칭으로 전환함으로써, 일반 행렬 계층과 다른 대칭을 가진 매개변수 블록, 즉 임베딩 및 LM 헤드 행렬, SwiGLU MLP 투영, MoE 라우터 행렬에 대한 대칭 호환 최적화기를 유도한다. 이러한 구성에는 단측 스펙트럴 갱신, 행 노름 갱신, 하이브리드 행 노름/스펙트럴 갱신, 행 인식 갱신, 열 인식 갱신, 중심화된 행 노름 갱신, 좌스펙트럴 갱신이 포함된다. 이는 각 주요 행렬값 매개변수 클래스에 등변성이 해당 대칭군과 일치하는 갱신이 할당된 종단 간 계층별 최적화기 스택을 산출한다. 우리는 Qwen3-0.6B 스타일, Gemma 3 1B 스타일, OLMoE-1B-7B 스타일, 축소된 gpt-oss 아키텍처를 포함한 밀집 및 희소 MoE 언어 모델에 대한 사전 학습 실험을 통해 이 원칙을 뒷받침한다. 이러한 실험 전반에 걸쳐, 대칭 호환 갱신은 해당 AdamW 갱신에 비해 최종 검증 손실을 일관되게 개선하고, 여러 경우 학습 안정성도 향상시킨다.

English

A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible updates consistently improve final validation loss, and in several cases training stability, over corresponding AdamW updates.