생성적 이미지 모델링 향상을 위한 이미지-특징 공동 합성

초록

잠재 확산 모델(LDMs)은 고품질 이미지 생성 분야를 주도하고 있지만, 표현 학습과 생성 모델링의 통합은 여전히 과제로 남아 있습니다. 우리는 변분 자동인코더(VAE)에서 얻은 저수준 이미지 잠재 표현과 DINO와 같은 사전 학습된 자기 지도 인코더에서 얻은 고수준 의미적 특성을 확산 모델을 통해 공동으로 모델링함으로써 이 간극을 자연스럽게 메우는 새로운 생성적 이미지 모델링 프레임워크를 제안합니다. 우리의 잠재-의미적 확산 접근법은 순수 노이즈로부터 일관된 이미지-특성 쌍을 생성하는 방법을 학습하여, 생성 품질과 학습 효율성을 크게 향상시키면서도 표준 Diffusion Transformer 아키텍처에 최소한의 수정만을 요구합니다. 복잡한 증류 목표를 제거함으로써, 우리의 통합 설계는 학습을 단순화하고 학습된 의미를 활용하여 이미지 생성을 조정하고 개선하는 강력한 새로운 추론 전략인 '표현 가이던스(Representation Guidance)'를 가능하게 합니다. 조건부 및 비조건부 설정 모두에서 평가된 우리의 방법은 이미지 품질과 학습 수렴 속도에서 상당한 개선을 보여주며, 표현 인식 생성 모델링을 위한 새로운 방향을 제시합니다.

English

Latent diffusion models (LDMs) dominate high-quality image generation, yet integrating representation learning with generative modeling remains a challenge. We introduce a novel generative image modeling framework that seamlessly bridges this gap by leveraging a diffusion model to jointly model low-level image latents (from a variational autoencoder) and high-level semantic features (from a pretrained self-supervised encoder like DINO). Our latent-semantic diffusion approach learns to generate coherent image-feature pairs from pure noise, significantly enhancing both generative quality and training efficiency, all while requiring only minimal modifications to standard Diffusion Transformer architectures. By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance, which leverages learned semantics to steer and refine image generation. Evaluated in both conditional and unconditional settings, our method delivers substantial improvements in image quality and training convergence speed, establishing a new direction for representation-aware generative modeling.

생성적 이미지 모델링 향상을 위한 이미지-특징 공동 합성

Boosting Generative Image Modeling via Joint Image-Feature Synthesis

초록

Support