NVIDIA Nemotron Nano 2:一款精準高效的混合型 Mamba-Transformer 推理模型
NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
August 20, 2025
作者: NVIDIA, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adi Renduchintala, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Ali Taghibakhshi, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amy Shen, Andrew Tao, Ann Guan, Anna Shors, Anubhav Mandarwal, Arham Mehta, Arun Venkatesan, Ashton Sharabiani, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Buddharaju, Banghua Zhu, Barnaby Simkin, Bilal Kartal, Bita Darvish Rouhani, Bobby Chen, Boris Ginsburg, Brandon Norick, Brian Yu, Bryan Catanzaro, Charles Wang, Charlie Truong, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Christian Munley, Christopher Parisien, Dan Su, Daniel Afrimi, Daniel Korzekwa, Daniel Rohrer, Daria Gitman, David Mosallanezhad, Deepak Narayanan, Dima Rekesh, Dina Yared, Dmytro Pykhtar, Dong Ahn, Duncan Riach, Eileen Long, Elliott Ning, Eric Chung, Erick Galinkin, Evelina Bakhturina, Gargi Prasad, Gerald Shen, Haim Elisha, Harsh Sharma, Hayley Ross, Helen Ngo, Herman Sahota, Hexin Wang, Hoo Chang Shin, Hua Huang, Iain Cunningham, Igor Gitman, Ivan Moshkov, Jaehun Jung, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jimmy Zhang, Jinze Xue, Jocelyn Huang, Joey Conway, John Kamalu, Jonathan Cohen, Joseph Jennings, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kari Briski, Katherine Cheung, Katherine Luna, Keith Wyss, Keshav Santhanam, Kezhi Kong, Krzysztof Pawelec, Kumar Anik, Kunlun Li, Kushan Ahmadian, Lawrence McAfee, Laya Sleiman, Leon Derczynski, Luis Vega, Maer Rodrigues de Melo, Makesh Narsimhan Sreedhar, Marcin Chochowski, Mark Cai, Markus Kliegl, Marta Stepniewska-Dziubinska, Matvei Novikov, Mehrzad Samadi, Meredith Price, Meriem Boubdir, Michael Boone, Michael Evans, Michal Bien, Michal Zawalski, Miguel Martinez, Mike Chrzanowski, Mohammad Shoeybi, Mostofa Patwary, Namit Dhameja, Nave Assaf, Negar Habibi, Nidhi Bhatia, Nikki Pope, Nima Tajbakhsh, Nirmal Kumar Juluru, Oleg Rybakov, Oleksii Hrinchuk, Oleksii Kuchaiev, Oluwatobi Olabiyi, Pablo Ribalta, Padmavathy Subramanian, Parth Chadha, Pavlo Molchanov, Peter Dykas, Peter Jin, Piotr Bialecki, Piotr Januszewski, Pradeep Thalasta, Prashant Gaikwad, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Rabeeh Karimi Mahabadi, Rajen Patel, Ran El-Yaniv, Ranjit Rajan, Ria Cheruvu, Rima Shahbazyan, Ritika Borkar, Ritu Gala, Roger Waleffe, Ruoxi Zhang, Russell J. Hewett, Ryan Prenger, Sahil Jain, Samuel Kriman, Sanjeev Satheesh, Saori Kaji, Sarah Yurick, Saurav Muralidharan, Sean Narenthiran, Seonmyeong Bak, Sepehr Sameni, Seungju Han, Shanmugam Ramasamy, Shaona Ghosh, Sharath Turuvekere Sreenivas, Shelby Thomas, Shizhe Diao, Shreya Gopal, Shrimai Prabhumoye, Shubham Toshniwal, Shuoyang Ding, Siddharth Singh, Siddhartha Jain, Somshubra Majumdar, Stefania Alborghetti, Syeda Nahida Akter, Terry Kong, Tim Moon, Tomasz Hliwiak, Tomer Asida, Tony Wang, Twinkle Vashishth, Tyler Poon, Udi Karpas, Vahid Noroozi, Venkat Srinivasan, Vijay Korthikanti, Vikram Fugro, Vineeth Kalluru, Vitaly Kurin, Vitaly Lavrukhin, Wasi Uddin Ahmad, Wei Du, Wonmin Byeon, Ximing Lu, Xin Dong, Yashaswi Karnati, Yejin Choi, Yian Zhang, Ying Lin, Yonggan Fu, Yoshi Suhara, Zhen Dong, Zhiyu Li, Zhongbo Zhu, Zijia Chen
cs.AI
摘要
我們推出Nemotron-Nano-9B-v2,這是一款混合Mamba-Transformer語言模型,旨在提升推理工作負載的吞吐量,同時在與同規模模型相比時達到最先進的準確度。Nemotron-Nano-9B-v2基於Nemotron-H架構,在該架構中,傳統Transformer架構中的大部分自注意力層被替換為Mamba-2層,從而提高生成推理所需的長思維軌跡時的推理速度。我們首先使用FP8訓練配方在20萬億個token上預訓練了一個120億參數的模型(Nemotron-Nano-12B-v2-Base),然後對其進行對齊。接著,我們採用Minitron策略對模型進行壓縮和蒸餾,目標是在單個NVIDIA A10G GPU(22GiB記憶體,bfloat16精度)上實現最多128k token的推理。與現有同規模模型(如Qwen3-8B)相比,我們展示出Nemotron-Nano-9B-v2在推理基準測試中達到相當或更好的準確度,同時在8k輸入和16k輸出token等推理場景中實現高達6倍的推理吞吐量。我們將在Hugging Face上發布Nemotron-Nano-9B-v2、Nemotron-Nano12B-v2-Base和Nemotron-Nano-9B-v2-Base的檢查點,以及我們大部分的預訓練和後訓練數據集。
English
We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model
designed to increase throughput for reasoning workloads while achieving
state-of-the-art accuracy compared to similarly-sized models.
Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the
majority of the self-attention layers in the common Transformer architecture
are replaced with Mamba-2 layers, to achieve improved inference speed when
generating the long thinking traces needed for reasoning. We create
Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model
(Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe.
After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to
compress and distill the model with the goal of enabling inference on up to
128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision).
Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that
Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks
while achieving up to 6x higher inference throughput in reasoning settings like
8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2,
Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with
the majority of our pre- and post-training datasets on Hugging Face.