Nemotron-Flash:迈向延迟最优的混合型小型语言模型
Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models
November 24, 2025
作者: Yonggan Fu, Xin Dong, Shizhe Diao, Matthijs Van keirsbilck, Hanrong Ye, Wonmin Byeon, Yashaswi Karnati, Lucas Liebenwein, Hannah Zhang, Nikolaus Binder, Maksim Khadkevich, Alexander Keller, Jan Kautz, Yingyan Celine Lin, Pavlo Molchanov
cs.AI
摘要
在具有严格延迟限制的众多实际应用中,小语言模型(SLM)的高效部署至关重要。尽管先前关于SLM设计的研究主要聚焦于减少参数量以实现参数最优的模型,但参数效率未必能转化为实际设备上的等比例加速。本研究旨在揭示影响SLM实际设备延迟的关键因素,并为以实际延迟为首要考量时的SLM设计与训练提供普适性原则和方法论。具体而言,我们识别出两个核心架构因素:深度-宽度比和算子选择。前者对小批量处理的延迟至关重要,后者则同时影响延迟和大批量处理的吞吐量。基于此,我们首先研究延迟最优的深度-宽度比,关键发现表明:尽管深窄模型在相同参数量下通常能获得更优精度,但它们可能并不处于精度-延迟权衡的前沿边界。接着,我们探索新兴高效注意力机制的替代方案,以评估其作为候选构建算子的潜力。利用识别出的潜力算子,我们构建进化搜索框架,自动发现这些算子在混合SLM中的延迟最优组合,从而推进精度-延迟边界。除架构改进外,我们进一步通过权重归一化技术增强SLM训练,该技术能实现更有效的权重更新并改善最终收敛性。综合这些方法,我们推出了名为Nemotron-Flash的新型混合SLM系列,显著推进了前沿SLM的精度-效率边界:相较于Qwen3-1.7B/0.6B模型,该系列平均精度提升超过5.5%,延迟降低1.3倍/1.9倍,吞吐量提升18.7倍/45.6倍。
English
Efficient deployment of small language models (SLMs) is essential for numerous real-world applications with stringent latency constraints. While previous work on SLM design has primarily focused on reducing the number of parameters to achieve parameter-optimal SLMs, parameter efficiency does not necessarily translate into proportional real-device speed-ups. This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training when real-device latency is the primary consideration. Specifically, we identify two central architectural factors: depth-width ratios and operator choices. The former is crucial for small-batch-size latency, while the latter affects both latency and large-batch-size throughput. In light of this, we first study latency-optimal depth-width ratios, with the key finding that although deep-thin models generally achieve better accuracy under the same parameter budget, they may not lie on the accuracy-latency trade-off frontier. Next, we explore emerging efficient attention alternatives to evaluate their potential as candidate building operators. Using the identified promising operators, we construct an evolutionary search framework to automatically discover latency-optimal combinations of these operators within hybrid SLMs, thereby advancing the accuracy-latency frontier. In addition to architectural improvements, we further enhance SLM training using a weight normalization technique that enables more effective weight updates and improves final convergence. Combining these methods, we introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs, e.g., achieving over +5.5% average accuracy, 1.3x/1.9x lower latency, and 18.7x/45.6x higher throughput compared to Qwen3-1.7B/0.6B, respectively.