NVIDIA Nemotron Nano 2：高精度で効率的なハイブリッド Mamba-Transformer推論モデル

要旨

私たちは、推論ワークロードにおけるスループットを向上させつつ、同規模のモデルと比較して最先端の精度を達成するハイブリッドMamba-Transformer言語モデル、Nemotron-Nano-9B-v2を紹介します。Nemotron-Nano-9B-v2は、一般的なTransformerアーキテクチャの大部分の自己注意層をMamba-2層に置き換えることで、推論に必要な長い思考トレースを生成する際の推論速度を向上させたNemotron-Hアーキテクチャを基盤としています。Nemotron-Nano-9B-v2は、まずFP8トレーニングレシピを使用して20兆トークンで12億パラメータのモデル（Nemotron-Nano-12B-v2-Base）を事前学習し、その後、Minitron戦略を用いてモデルを圧縮・蒸留し、単一のNVIDIA A10G GPU（22GiBメモリ、bfloat16精度）で最大128kトークンの推論を可能にすることを目指しています。既存の同規模モデル（例：Qwen3-8B）と比較して、Nemotron-Nano-9B-v2は推論ベンチマークで同等またはそれ以上の精度を達成しつつ、8k入力と16k出力トークンといった推論設定において最大6倍の推論スループットを実現します。私たちは、Nemotron-Nano-9B-v2、Nemotron-Nano12B-v2-Base、およびNemotron-Nano-9B-v2-Baseのチェックポイントと、事前学習および事後学習データセットの大部分をHugging Faceで公開します。

English

We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.