優化器設計的對稱兼容原則：嵌入、LM頭、SwiGLU多層感知機與MoE路由器

摘要

在深度学习实践中，长期存在一种引人注目的几何不一致性。现代神经网络架构天然展现出丰富的对称性与等变性，而Adam及其变体等主流优化器本质上是坐标方向操作的，导致其无法尊重参数空间的等变结构。我们通过引入一种与对称性兼容的优化器设计原则来解决这一不一致性：梯度更新规则应在作用于相应权重块的对称群下保持等变性。遵循这一原则，我们首先为通用矩阵层的双正交等变更新提供了统一视角，这些更新已被随机谱下降法、Muon、Scion及极坐标梯度方法所采用。更重要的是，通过从正交群转向置换对称与共享平移对称，我们推导出适用于参数块的对称性兼容优化器，这些参数块的对称性不同于通用矩阵层：嵌入矩阵与LM头矩阵、SwiGLU MLP投影矩阵以及MoE路由器矩阵。这些构造包括单侧谱更新、行范数更新、混合行范数/谱更新、行感知更新、列感知更新、中心化行范数更新及左谱更新。这构成了一种端到端的逐层优化器栈，其中每个主要的矩阵值参数类都被分配了一种更新规则，其等变性与该参数类的对称群相匹配。我们通过在密集与稀疏MoE语言模型上的预训练实验验证了这一原则，实验模型包括Qwen3-0.6B风格、Gemma 3 1B风格、OLMoE-1B-7B风格以及缩小版的gpt-oss架构。在这些实验中，对称性兼容的更新相比相应的AdamW更新，始终能改善最终的验证损失，并在若干情况下提升训练稳定性。

English

A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible updates consistently improve final validation loss, and in several cases training stability, over corresponding AdamW updates.