마노: LLM 훈련을 위한 재충격 매니폴드 최적화

초록

대규모 언어 모델(LLM)이 인공 지능 분야의 주요 발전으로 부상했지만, LLM 훈련에 필요한 하드웨어 및 계산 비용 역시 상당히 부담스럽다. 최신 최적화 도구(optimizer) 중 AdamW는 대각선 곡률 추정에 의존하여 구조적 특성을 무시하는 한편, Muon은 곡률 정보 손실을 대가로 전역 스펙트럼 정규화를 적용한다. 본 연구에서는 기존 매니폴드 최적화 방법이 대규모 모델 최적화에서 낮은 성능으로 인해 크게 간과되어 왔음에도 불구하고, 양 최적화 도구의 한계를 동시에 해결할 수 있는 매니폴드 최적화 방법을 LLM 훈련에 재조명하였다. 모델 매개변수의 접공간(tangent space)으로 모멘텀을 혁신적으로 투영하고 이를 회전 Oblique 매니폴드 위에 구속함으로써, 우리는 매니폴드 최적화와 현대적 최적화 도구 간의 성능 격차를 최초로 해소한 강력하고 효율적인 새로운 최적화 도구 **Mano**를 제안한다. LLaMA 및 Qwen3 모델에 대한 광범위한 실험 결과, Mano는 각각 더 적은 메모리 소비와 계산 복잡도로도 AdamW 및 Muon을 지속적이고 현저히 능가하는 것으로 나타나, 공간 및 시간 효율성 측면에서 확장된 파레토 프론티어(Pareto frontier)를 제시한다.

English

While large language models (LLMs) have emerged as a significant advancement in artificial intelligence, the hardware and computational costs for training LLMs are also significantly burdensome. Among the state-of-the-art optimizers, AdamW relies on diagonal curvature estimates and ignores structural properties, while Muon applies global spectral normalization at the expense of losing curvature information. In this study, we restriked manifold optimization methods for training LLMs, which may address both optimizers' limitations, while conventional manifold optimization methods have been largely overlooked due to the poor performance in large-scale model optimization. By innovatively projecting the momentum onto the tangent space of model parameters and constraining it on a rotational Oblique manifold, we propose a novel, powerful, and efficient optimizer **Mano** that is the first to bridge the performance gap between manifold optimization and modern optimizers. Extensive experiments on the LLaMA and Qwen3 models demonstrate that Mano consistently and significantly outperforms AdamW and Muon even with less memory consumption and computational complexity, respectively, suggesting an expanded Pareto frontier in terms of space and time efficiency.

마노: LLM 훈련을 위한 재충격 매니폴드 최적화

Mano: Restriking Manifold Optimization for LLM Training

초록

Support