生成画像モデリングの促進：画像と特徴量の共同合成によるアプローチ

要旨

潜在拡散モデル（LDMs）は高品質な画像生成を支配しているが、表現学習と生成モデリングを統合することは依然として課題である。本研究では、拡散モデルを活用して、低レベルの画像潜在変数（変分オートエンコーダから）と高レベルの意味的特徴（DINOのような事前学習済み自己教師ありエンコーダから）を共同でモデル化する、新しい生成画像モデリングフレームワークを提案する。我々の潜在-意味拡散アプローチは、純粋なノイズから一貫性のある画像-特徴ペアを生成することを学習し、生成品質と学習効率の両方を大幅に向上させ、標準的なDiffusion Transformerアーキテクチャに最小限の変更を加えるだけで実現する。複雑な蒸留目的関数を不要にすることで、我々の統一設計は学習を簡素化し、学習済みの意味を活用して画像生成を誘導・洗練する強力な新しい推論戦略「表現ガイダンス」を可能にする。条件付きおよび無条件設定の両方で評価された本手法は、画像品質と学習収束速度の大幅な改善をもたらし、表現を意識した生成モデリングの新たな方向性を確立する。

English

Latent diffusion models (LDMs) dominate high-quality image generation, yet integrating representation learning with generative modeling remains a challenge. We introduce a novel generative image modeling framework that seamlessly bridges this gap by leveraging a diffusion model to jointly model low-level image latents (from a variational autoencoder) and high-level semantic features (from a pretrained self-supervised encoder like DINO). Our latent-semantic diffusion approach learns to generate coherent image-feature pairs from pure noise, significantly enhancing both generative quality and training efficiency, all while requiring only minimal modifications to standard Diffusion Transformer architectures. By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance, which leverages learned semantics to steer and refine image generation. Evaluated in both conditional and unconditional settings, our method delivers substantial improvements in image quality and training convergence speed, establishing a new direction for representation-aware generative modeling.

生成画像モデリングの促進：画像と特徴量の共同合成によるアプローチ

Boosting Generative Image Modeling via Joint Image-Feature Synthesis

要旨

Support