ChatPaper.aiChatPaper

Nemotron Elastic:邁向高效多任務合一推理大型語言模型

Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs

November 20, 2025
作者: Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Ruisi Cai, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov
cs.AI

摘要

針對多種規模與部署目標訓練大型語言模型家族的成本極高,需要為每個不同規模的模型進行獨立訓練。近期透過剪枝與知識蒸餾的模型壓縮技術雖降低了成本,但每個壓縮模型仍需消耗數千億標記的訓練成本。本文提出Nemotron Elastic框架,用於構建面向推理的混合型Mamba-Attention架構LLM,該框架能在單一父模型中嵌入多個嵌套子模型,每個子模型皆針對不同部署配置與預算進行優化。這些子模型與父模型共享權重,且可在部署時零射擊提取,無需額外訓練或微調。我們透過端到端訓練的路由器實現此功能,該路由器與專為推理模型設計的兩階段訓練課程緊密耦合。此外,我們提出保留Mamba結構約束的群組感知SSM彈性化技術、異質性MLP彈性化技術、基於歸一化MSE的層重要性評估以改進深度選擇,以及實現同步多預算優化的知識蒸餾技術。我們將Nemotron Elastic應用於Nemotron Nano V2 12B模型,僅使用1100億訓練標記即可同步生成9B與6B模型:相比從頭訓練模型家族實現超過360倍成本壓縮,相較現有壓縮技術亦有約7倍優勢。所有嵌套模型在準確度上均達到或超越現有技術水平。更重要的是,有別於其他壓縮方法,我們的嵌套特性可實現「多合一」推理模型,使部署記憶體消耗在模型家族數量增加時保持恆定。
English
Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba's structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.
PDF243December 1, 2025