DiffusionBench: 확산 트랜스포머의 종합적 평가

초록

이미지 생성을 위한 확산 트랜스포머(DiT) 연구는 ImageNet에서의 클래스 조건부 생성이라는 단일 평가 설정으로 수렴하고 있다. 방법론들이 FID 및 관련 지표를 개선하고 있지만, 이것들이 생성 모델링의 실질적인 진전을 반영하는지 여부는 점점 불분명해지고 있다. 자연스러운 대안인 텍스트-투-이미지(T2I) 생성은 훈련 및 평가에 너무 많은 비용이 들거나 불편하다고 인식되어 종종 생략된다. 우리는 이러한 인식이 더 이상 유효하지 않다고 주장한다. 우리는 통합된 DiT 훈련 및 평가 프레임워크인 NanoGen을 소개한다. NanoGen은 ImageNet에서 최첨단 DiT 기준선과 일치하는 성능을 보이며, 단 12줄의 구성 변경만으로도 경쟁력 있는 텍스트-투-이미지 모델을 훈련한다. 현재 NanoGen은 ImageNet과 T2I 설정 모두에서 RAE, VAE, 픽셀 공간, MeanFlow 확산 방법을 지원한다. NanoGen에서 T2I 훈련은 ImageNet과 유사한 계산량을 필요로 한다. NanoGen으로 21개의 잠재 확산 모델을 훈련한 후, 방법 순위가 ImageNet과 T2I 생성 간에 강한 상관관계를 보이지 않음을 관찰했다: 세 가지 지표에서 피어슨 상관계수는 -0.377에서 -0.580 사이였다. 이는 클래스 조건부 ImageNet FID를 개선하는 방법이 T2I에서 상응하는 개선을 보이지 않을 수 있음을 시사하며, DiT를 두 작업 모두에서 평가해야 할 필요성을 분명히 보여준다. 이를 위해 ImageNet 및 텍스트-투-이미지 결과를 요약하여 DiT 연구를 위한 포괄적 벤치마크인 DiffusionBench를 도출했다. 우리는 ImageNet 단독 대신 DiffusionBench를 보고할 것을 권장한다: DiffusionBench를 개선하는 방법이 더 광범위한 진전을 반영할 가능성이 높다.

English

Diffusion transformer (DiT) research on image generation has converged to a single evaluation setup: class-conditional generation on ImageNet. While methods improve the FID and related metrics, it is increasingly unclear whether they reflect real progress in generative modeling. The natural alternative, i.e., text-to-image (T2I) generation, is perceived as too costly or inconvenient to train and evaluate and is often skipped. We argue that this perception no longer holds. We introduce NanoGen, a unified DiT training and evaluation framework. NanoGen matches state-of-the-art DiT baselines on ImageNet and, with 12 lines of configuration change, also trains competitive text-to-image models. It currently supports RAE, VAE, pixel-space, and MeanFlow diffusion methods under both ImageNet and T2I setups. Under NanoGen, training T2I requires comparable compute to ImageNet. After training 21 latent diffusion models with NanoGen, we observe that method ranking shows no strong correlation between ImageNet and T2I generation: Pearson correlation is between -0.377 and -0.580 across three metrics. This suggests that a method which improves class-conditional ImageNet FID may show no corresponding improvement on T2I, clearly indicating the necessity of evaluating DiTs on both tasks. To this end, we summarize ImageNet and text-to-image results, which yields DiffusionBench, a holistic benchmark for DiT research. We recommend reporting DiffusionBench in place of ImageNet alone: methods that improve DiffusionBench are more likely to reflect broader progress.