対称性整合原理に基づくオプティマイザ設計：埋め込み、LMヘッド、SwiGLU MLP、MoEルーター

要旨

深層学習の実践において、顕著な幾何学的不一致が長らく存在してきた。現代のニューラルネットワークアーキテクチャは豊かな対称性と等変性を自然に示す一方、Adamおよびそのバリアントのような一般的な最適化器は本質的に座標ごとに動作するため、パラメータ空間の等変性構造を尊重することができない。本論文では、この不一致に対処するため、対称性と互換性のある最適化器設計の原理、すなわち勾配更新則が対応する重みブロックに作用する対称群の下で等変でなければならないという原理を導入する。この原理に従い、まず、確率的スペクトル降下法、Muon、Scion、および極勾配法で用いられる、一般の行列層に対する双直交等変更新の統一的な視点を提供する。さらに重要なことに、直交群から置換対称性および共有シフト対称性へと移行することで、一般の行列層とは異なる対称性を持つパラメータブロック（埋め込み行列およびLMヘッド行列、SwiGLU MLP射影、MoEルーター行列）に対する対称性互換最適化器を導出する。これらの構成には、片側スペクトル更新、行ノルム更新、ハイブリッド行ノルム/スペクトル更新、行認識更新、列認識更新、中心化行ノルム更新、および左スペクトル更新が含まれる。これにより、主要な行列値パラメータクラスのそれぞれに、その対称群と等変性が一致する更新を割り当てた、エンドツーエンドの層ごとの最適化器スタックが得られる。本原理は、Qwen3-0.6Bスタイル、Gemma 3 1Bスタイル、OLMoE-1B-7Bスタイル、および小型化されたgpt-ossアーキテクチャを含む、密および疎MoE言語モデルに関する事前学習実験によって裏付けられる。これらの実験において、対称性互換更新は、対応するAdamW更新と比較して、最終的な検証損失を一貫して改善し、複数のケースでは訓練安定性も向上させた。

English

A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible updates consistently improve final validation loss, and in several cases training stability, over corresponding AdamW updates.