ChatPaper.aiChatPaper

Mano:面向大语言模型训练的重启流形优化方法

Mano: Restriking Manifold Optimization for LLM Training

January 30, 2026
作者: Yufei Gu, Zeke Xie
cs.AI

摘要

尽管大语言模型(LLM)已成为人工智能领域的重要突破,但其训练所需的硬件与计算成本也构成显著负担。在当前主流优化器中,AdamW依赖对角曲率估计而忽略结构特性,Muon则通过全局谱归一化牺牲了曲率信息。本研究重新审视了流形优化方法在LLM训练中的应用——传统流形优化方法虽因在大规模模型中的表现不佳长期被忽视,但可能同时解决上述两种优化器的局限性。通过创新性地将动量投影至模型参数的切空间并约束于旋转斜交流形,我们提出了一种新颖高效且功能强大的优化器**Mano**,首次弥合了流形优化与现代优化器之间的性能鸿沟。基于LLaMA和Qwen3模型的广泛实验表明,Mano在分别减少内存消耗与计算复杂度的前提下,仍能持续显著超越AdamW和Muon,从而在时空效率维度拓展了帕累托前沿。
English
While large language models (LLMs) have emerged as a significant advancement in artificial intelligence, the hardware and computational costs for training LLMs are also significantly burdensome. Among the state-of-the-art optimizers, AdamW relies on diagonal curvature estimates and ignores structural properties, while Muon applies global spectral normalization at the expense of losing curvature information. In this study, we restriked manifold optimization methods for training LLMs, which may address both optimizers' limitations, while conventional manifold optimization methods have been largely overlooked due to the poor performance in large-scale model optimization. By innovatively projecting the momentum onto the tangent space of model parameters and constraining it on a rotational Oblique manifold, we propose a novel, powerful, and efficient optimizer **Mano** that is the first to bridge the performance gap between manifold optimization and modern optimizers. Extensive experiments on the LLaMA and Qwen3 models demonstrate that Mano consistently and significantly outperforms AdamW and Muon even with less memory consumption and computational complexity, respectively, suggesting an expanded Pareto frontier in terms of space and time efficiency.
PDF23February 7, 2026