用于优化器设计的对称兼容原理：嵌入层、LM头、SwiGLU MLPs与MoE路由器

摘要

深度学习实践中长期存在一个显著的几何差异。现代神经网络架构天然展现丰富的对称性与等变性，而Adam及其变体等主流优化器本质上是按坐标操作的，无法尊重参数空间的等变性结构。为解决这一差异，我们提出一种对称性兼容的优化器设计原则：梯度更新规则应在作用于对应权重块的对称群下保持等变。遵循该原则，我们首先从统一视角审视了随机谱下降、Muon、Scion和极梯度方法中针对通用矩阵层所采用的双正交等变更新。更重要的是，通过从正交群推广到置换对称与共享平移对称，我们推导出适用于参数块（其对称性与通用矩阵层不同）的对称性兼容优化器：嵌入层与语言模型头部矩阵、SwiGLU MLP投影、以及MoE路由矩阵。这些构造包括单侧谱更新、行范数更新、混合行范数/谱更新、行感知更新、列感知更新、中心化行范数更新和左谱更新。由此形成一套端到端的逐层优化器栈，其中每个主要的矩阵值参数类都被赋予与其对称群等变匹配的更新。我们通过预训练实验验证了这一原则，实验涵盖稠密与稀疏MoE语言模型，包括Qwen3-0.6B风格、Gemma 3 1B风格、OLMoE-1B-7B风格和缩小规模的gpt-oss架构。在这些实验中，与对应AdamW更新相比，对称性兼容更新持续改善了最终验证损失，并在多个案例中提升了训练稳定性。

English

A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible updates consistently improve final validation loss, and in several cases training stability, over corresponding AdamW updates.