PixArt-α: 사실적인 텍스트-이미지 합성을 위한 Diffusion Transformer의 빠른 학습

초록

가장 진보된 텍스트-이미지(T2I) 모델은 상당한 학습 비용(예: 수백만 GPU 시간)을 요구하며, 이는 AIGC 커뮤니티의 근본적인 혁신을 심각하게 저해하고 CO2 배출량을 증가시킵니다. 본 논문은 Transformer 기반의 T2I 확산 모델인 PIXART-alpha를 소개합니다. 이 모델은 이미지 생성 품질에서 최첨단 이미지 생성기(예: Imagen, SDXL, 심지어 Midjourney)와 경쟁력이 있으며, 상업적 응용 기준에 근접합니다. 또한, 그림 1과 2에서 보여주듯이 낮은 학습 비용으로 최대 1024px 해상도의 고해상도 이미지 합성을 지원합니다. 이를 달성하기 위해 세 가지 핵심 설계를 제안합니다: (1) 학습 전략 분해: 픽셀 의존성, 텍스트-이미지 정렬, 이미지 미적 품질을 각각 최적화하는 세 가지 별도의 학습 단계를 고안합니다; (2) 효율적인 T2I Transformer: 텍스트 조건을 주입하고 계산 집약적인 클래스 조건 분기를 간소화하기 위해 Diffusion Transformer(DiT)에 교차 주의 모듈을 통합합니다; (3) 고정보 데이터: 텍스트-이미지 쌍에서 개념 밀도의 중요성을 강조하고, 대규모 Vision-Language 모델을 활용하여 밀집된 가짜 캡션을 자동으로 레이블링하여 텍스트-이미지 정렬 학습을 지원합니다. 결과적으로, PIXART-alpha의 학습 속도는 기존 대규모 T2I 모델을 크게 능가하며, 예를 들어 PIXART-alpha는 Stable Diffusion v1.5의 학습 시간의 10.8%만 소요됩니다(675 vs. 6,250 A100 GPU 일). 이는 약 \300,000(26,000 vs. \320,000)을 절약하고 CO2 배출량을 90% 감소시킵니다. 더욱이, 더 큰 SOTA 모델인 RAPHAEL과 비교했을 때, 우리의 학습 비용은 단 1%에 불과합니다. 광범위한 실험을 통해 PIXART-alpha가 이미지 품질, 예술성, 의미론적 제어에서 우수함을 입증합니다. 우리는 PIXART-alpha가 AIGC 커뮤니티와 스타트업이 고품질이면서도 저비용의 생성 모델을 처음부터 구축하는 속도를 가속화하는 데 새로운 통찰을 제공하기를 바랍니다.

English

The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-alpha, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-alpha's training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-alpha only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \300,000 (26,000 vs. \320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-\alpha excels in image quality, artistry, and semantic control. We hope PIXART-\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.

PixArt-α: 사실적인 텍스트-이미지 합성을 위한 Diffusion Transformer의 빠른 학습

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

초록

Support