D^2iT: 정확한 이미지 생성을 위한 동적 확산 트랜스포머

초록

디퓨전 모델은 고해상도 이미지 생성 능력으로 널리 알려져 있습니다. 디퓨전 트랜스포머(DiT) 아키텍처의 우수한 성능과 확장성에도 불구하고, 이 모델은 디퓨전 과정에서 이미지의 다양한 영역에 고정된 압축을 적용하여 각 영역의 자연스럽게 변화하는 정보 밀도를 고려하지 않습니다. 그러나 과도한 압축은 지역적 현실감을 제한하고, 작은 압축은 계산 복잡성을 증가시키며 전역적 일관성을 저해하여 최종적으로 생성된 이미지의 품질에 영향을 미칩니다. 이러한 한계를 해결하기 위해, 우리는 다양한 이미지 영역의 중요성을 인식하여 동적으로 압축하는 방법을 제안하고, 이미지 생성의 효과성과 효율성을 향상시키기 위한 새로운 2단계 프레임워크를 소개합니다: (1) 첫 번째 단계의 동적 VAE(DVAE)는 계층적 인코더를 사용하여 각 이미지 영역의 정보 밀도에 맞춰 다른 다운샘플링 비율로 인코딩함으로써, 디퓨전 과정을 위해 더 정확하고 자연스러운 잠재 코드를 제공합니다. (2) 두 번째 단계의 동적 디퓨전 트랜스포머(D^2iT)는 동적 그레인 트랜스포머와 동적 콘텐츠 트랜스포머의 새로운 조합을 통해, 거친 그레인(매끄러운 영역에서는 적은 잠재 코드)과 세밀한 그레인(디테일이 많은 영역에서는 더 많은 잠재 코드)으로 구성된 다중 그레인 노이즈를 예측하여 이미지를 생성합니다. 노이즈의 대략적인 예측과 세밀한 영역 보정을 결합하는 이 전략은 전역적 일관성과 지역적 현실감의 통합을 달성합니다. 다양한 생성 작업에 대한 포괄적인 실험을 통해 우리의 접근 방식의 효과성을 검증하였습니다. 코드는 https://github.com/jiawn-creator/Dynamic-DiT에서 공개될 예정입니다.

English

Diffusion models are widely recognized for their ability to generate high-fidelity images. Despite the excellent performance and scalability of the Diffusion Transformer (DiT) architecture, it applies fixed compression across different image regions during the diffusion process, disregarding the naturally varying information densities present in these regions. However, large compression leads to limited local realism, while small compression increases computational complexity and compromises global consistency, ultimately impacting the quality of generated images. To address these limitations, we propose dynamically compressing different image regions by recognizing the importance of different regions, and introduce a novel two-stage framework designed to enhance the effectiveness and efficiency of image generation: (1) Dynamic VAE (DVAE) at first stage employs a hierarchical encoder to encode different image regions at different downsampling rates, tailored to their specific information densities, thereby providing more accurate and natural latent codes for the diffusion process. (2) Dynamic Diffusion Transformer (D^2iT) at second stage generates images by predicting multi-grained noise, consisting of coarse-grained (less latent code in smooth regions) and fine-grained (more latent codes in detailed regions), through an novel combination of the Dynamic Grain Transformer and the Dynamic Content Transformer. The strategy of combining rough prediction of noise with detailed regions correction achieves a unification of global consistency and local realism. Comprehensive experiments on various generation tasks validate the effectiveness of our approach. Code will be released at https://github.com/jiawn-creator/Dynamic-DiT.

D^2iT: 정확한 이미지 생성을 위한 동적 확산 트랜스포머

D^2iT: Dynamic Diffusion Transformer for Accurate Image Generation

초록

Support