Nemotron弹性模型:迈向高效多任务推理大语言模型
Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs
November 20, 2025
作者: Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Ruisi Cai, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov
cs.AI
摘要
训练面向多尺度与多部署目标的大语言模型家族成本极其高昂,每个不同规模的模型都需要独立训练。近期通过剪枝和知识蒸馏实现的模型压缩技术虽降低了成本,但每个压缩模型仍需消耗数万亿标记的训练开销。本文提出Nemotron Elastic框架,用于构建面向推理的混合Mamba-Attention架构大模型,该框架可在单一父模型中嵌入多个嵌套子模型,每个子模型针对不同部署配置和预算进行优化。这些子模型与父模型共享权重,无需额外训练或微调即可在部署时零样本提取。我们通过端到端训练的路由器实现该功能,该路由器与专为推理模型设计的两阶段训练课程紧密耦合。我们还提出保持Mamba结构约束的组感知SSM弹性化技术、异构MLP弹性化技术、基于归一化MSE的层重要性评估以改进深度选择,以及支持同步多预算优化的知识蒸馏方法。我们将Nemotron Elastic应用于Nemotron Nano V2 12B模型,仅用1100亿训练标记即可同步生成90亿和60亿参数模型,相比从头训练模型家族实现超过360倍的成本降低,相较现有最优压缩技术也有约7倍提升。所有嵌套模型在准确率上均达到或超越现有最优水平。此外,与其他压缩方法不同,本方案的嵌套特性可实现"多合一"推理模型,其部署内存占用与模型家族数量保持恒定。
English
Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba's structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.