Nemotron-Labs-Diffusion-Image：高解像度画像合成のためのマスクド離散拡散の進展

要旨

我々は、高解像度のテキストから画像への合成のための最先端のマスク離散拡散モデル（MDM）であるNemotron-Labs-Diffusion-Imageを提案する。マスク画像生成に関する先行研究と比較して、Nemotron-Labs-Diffusion-Imageは2つの主要な課題に取り組む。第一に、画像全体にわたって潜在表現を徐々に洗練する連続拡散モデルとは異なり、標準的なMDMは自己修正能力を欠いている。なぜなら、いったんマスクが解除された離散トークンは変更できないからである。第二に、離散画像トークナイザーの語彙サイズを増やすことで再現忠実度は向上するが、トークンごとの学習信号がますます疎になるため、生成モデリングにおける最適化の困難が生じる。第一の課題に対処するため、Nemotron-Labs-Diffusion-Imageはトークン編集メカニズムを組み込み、彫刻家が作品を反復的に洗練するように、推論中に既にマスク解除されたトークンを動的に修正できるようにする。第二の課題に取り組むため、我々はグループ化クロスエントロピー（GCE）目的関数を提案する。これは、埋め込み空間において真値に隣接するトークンに正の学習信号を割り当てることで、信号のスパース性を緩和する。訓練効率をさらに向上させるため、GCE用のカスタム融合演算子を実装し、大語彙設定でのVRAM使用量を大幅に削減する。実験結果は、これらの革新がマスク離散画像生成器の訓練効率と画像忠実度の両方を大幅に向上させ、GenEvalで0.90、DPGで86.9、HPSv3で10.76のスコアを達成することを示している。

English

We propose Nemotron-Labs-Diffusion-Image, a state-of-the-art masked discrete diffusion model (MDM) for high-resolution text-to-image synthesis. Compared with prior work on masked image generation, Nemotron-Labs-Diffusion-Image addresses two key challenges. First, unlike continuous diffusion models which progressively refine latent representations across the entire image, standard MDMs lack self-correcting capability because discrete tokens cannot be modified once they are unmasked. Second, although increasing the vocabulary size of discrete image tokenizers improves reconstruction fidelity, it introduces optimization difficulties for generative modeling as the per-token training signal becomes increasingly sparse. To address the first challenge, Nemotron-Labs-Diffusion-Image incorporates a token-editing mechanism that enables the model to dynamically revise already-unmasked tokens during inference, similar to how a sculptor iteratively refines their work. To tackle the second challenge, we propose a Grouped Cross-Entropy (GCE) objective that assigns positive learning signals to tokens neighboring the ground truth in embedding space, thereby alleviating signal sparsity. To further improve training efficiency, we implement a custom fused operator for GCE that significantly reduces VRAM usage in large-vocabulary settings. Experimental results demonstrate that these innovations substantially improve both training efficiency and image fidelity of masked discrete image generators, achieving a score of 0.90 on GenEval, 86.9 on DPG and 10.76 of HPSv3.