DiffusionBench：扩散变换器的整体评估

DiffusionBench: On Holistic Evaluation of Diffusion Transformers

June 23, 2026

作者: Xingjian Leng, Jaskirat Singh, Zhanhao Liang, Ethan Smith, Martin Bell, Aninda Saha, Yuhui Yuan, Liang Zheng

cs.AI

摘要

扩散变换器（DiT）在图像生成领域的研究已收敛到单一的评估设置：ImageNet上的类条件生成。尽管方法改进了FID及相关指标，但越来越不清楚它们是否反映了生成建模的真正进展。自然的替代方案，即文本到图像（T2I）生成，被认为训练和评估成本过高或不便，常被跳过。我们认为这一看法已不再成立。我们提出NanoGen，一个统一的DiT训练与评估框架。NanoGen在ImageNet上匹配了最先进的DiT基线，且仅需更改12行配置即可训练出具有竞争力的文本到图像模型。它目前支持RAE、VAE、像素空间和MeanFlow扩散方法，同时适用于ImageNet与T2I设置。在NanoGen下，训练T2I所需的计算量与ImageNet相当。通过NanoGen训练21个潜在扩散模型后，我们发现方法排名在ImageNet与T2I生成之间没有强相关性：三个指标上的皮尔逊相关系数在-0.377至-0.580之间。这表明，一种在类条件ImageNet FID上有所改进的方法，可能在T2I上并无相应提升，清晰表明有必要在两个任务上评估DiT。为此，我们总结了ImageNet与文本到图像的结果，形成了DiffusionBench，一个用于DiT研究的整体基准。我们建议报告DiffusionBench而非仅报告ImageNet：在DiffusionBench上取得改进的方法更有可能反映更广泛的进展。

English

Diffusion transformer (DiT) research on image generation has converged to a single evaluation setup: class-conditional generation on ImageNet. While methods improve the FID and related metrics, it is increasingly unclear whether they reflect real progress in generative modeling. The natural alternative, i.e., text-to-image (T2I) generation, is perceived as too costly or inconvenient to train and evaluate and is often skipped. We argue that this perception no longer holds. We introduce NanoGen, a unified DiT training and evaluation framework. NanoGen matches state-of-the-art DiT baselines on ImageNet and, with 12 lines of configuration change, also trains competitive text-to-image models. It currently supports RAE, VAE, pixel-space, and MeanFlow diffusion methods under both ImageNet and T2I setups. Under NanoGen, training T2I requires comparable compute to ImageNet. After training 21 latent diffusion models with NanoGen, we observe that method ranking shows no strong correlation between ImageNet and T2I generation: Pearson correlation is between -0.377 and -0.580 across three metrics. This suggests that a method which improves class-conditional ImageNet FID may show no corresponding improvement on T2I, clearly indicating the necessity of evaluating DiTs on both tasks. To this end, we summarize ImageNet and text-to-image results, which yields DiffusionBench, a holistic benchmark for DiT research. We recommend reporting DiffusionBench in place of ImageNet alone: methods that improve DiffusionBench are more likely to reflect broader progress.