DiffusionBench: 拡散トランスフォーマーの包括的評価

要旨

画像生成における拡散トランスフォーマー（DiT）研究は、ImageNetにおけるクラス条件付き生成という単一の評価設定に収束しています。手法の改善によってFIDや関連指標は向上していますが、それらが生成モデリングにおける真の進歩を反映しているかどうかは、ますます不明確になっています。当然の代替案であるテキスト条件付き画像（T2I）生成は、訓練や評価にコストがかかりすぎる、または不便と認識され、しばしば省略されています。しかし、我々はこの認識がもはや正しくないと主張します。本稿では、統合的なDiT訓練・評価フレームワークであるNanoGenを紹介します。NanoGenはImageNet上で最先端のDiTベースラインに匹敵する性能を達成し、設定変更わずか12行で競争力のあるT2Iモデルも訓練できます。現在、ImageNetおよびT2I設定の両方において、RAE、VAE、ピクセル空間、MeanFlow拡散法をサポートしています。NanoGenでは、T2I訓練に必要な計算リソースはImageNetと同等です。NanoGenを用いて21の潜在拡散モデルを訓練した結果、手法の順位付けはImageNetとT2I生成の間に強い相関を示さないことが観察されました。ピアソン相関係数は3つの指標で-0.377から-0.580の範囲でした。これは、クラス条件付きImageNetのFIDを改善する手法がT2Iでも対応する改善を示さない可能性を示唆しており、両方のタスクでDiTを評価する必要性を明確に示しています。この目的のために、ImageNetとT2Iの結果をまとめた統合的なベンチマーク、DiffusionBenchを提示します。我々はImageNet単独ではなく、DiffusionBenchを報告することを推奨します。DiffusionBenchを改善する手法は、より広範な進歩を反映する可能性が高いと考えられます。

English

Diffusion transformer (DiT) research on image generation has converged to a single evaluation setup: class-conditional generation on ImageNet. While methods improve the FID and related metrics, it is increasingly unclear whether they reflect real progress in generative modeling. The natural alternative, i.e., text-to-image (T2I) generation, is perceived as too costly or inconvenient to train and evaluate and is often skipped. We argue that this perception no longer holds. We introduce NanoGen, a unified DiT training and evaluation framework. NanoGen matches state-of-the-art DiT baselines on ImageNet and, with 12 lines of configuration change, also trains competitive text-to-image models. It currently supports RAE, VAE, pixel-space, and MeanFlow diffusion methods under both ImageNet and T2I setups. Under NanoGen, training T2I requires comparable compute to ImageNet. After training 21 latent diffusion models with NanoGen, we observe that method ranking shows no strong correlation between ImageNet and T2I generation: Pearson correlation is between -0.377 and -0.580 across three metrics. This suggests that a method which improves class-conditional ImageNet FID may show no corresponding improvement on T2I, clearly indicating the necessity of evaluating DiTs on both tasks. To this end, we summarize ImageNet and text-to-image results, which yields DiffusionBench, a holistic benchmark for DiT research. We recommend reporting DiffusionBench in place of ImageNet alone: methods that improve DiffusionBench are more likely to reflect broader progress.