画像と特徴量の共同拡散における共進化的表現

要旨

画像と特徴量の統合的生成モデリングは、低次元のVAE潜在変数と事前学習済み視覚エンコーダーから抽出された高次元の意味特徴量を結合することで、拡散訓練を改善する効果的な手法として近年注目を集めている。しかし、既存の手法は生成目標から独立して構築され、訓練中に固定された表現空間に依存している。我々は、拡散モデルを誘導する表現空間自体が生成タスクに適応すべきであると主張する。この目的に向け、我々はCoevolving Representation Diffusion（CoReDi）を提案する。このフレームワークでは、軽量な線形投影を拡散モデルと共同で学習することで、意味的表現空間が訓練中に進化する。単純にこの投影を最適化すると縮退解が生じるが、勾配停止ターゲット、正規化、特徴量崩壊を防ぐターゲット正則化を組み合わせることで、安定した共進化が実現可能であることを示す。この定式化により、意味空間は画像合成の要求に特化し、画像潜在変数との相補性を向上させる。我々はCoReDiをVAE潜在拡散とピクセル空間拡散の両方に適用し、適応的意味表現が両設定において生成モデリングを改善することを実証する。実験により、CoReDiが固定表現空間で動作する統合拡散モデルと比較して、より高速な収束と高いサンプル品質を達成することを示す。

English

Joint image-feature generative modeling has recently emerged as an effective strategy for improving diffusion training by coupling low-level VAE latents with high-level semantic features extracted from pre-trained visual encoders. However, existing approaches rely on a fixed representation space, constructed independently of the generative objective and kept unchanged during training. We argue that the representation space guiding diffusion should itself adapt to the generative task. To this end, we propose Coevolving Representation Diffusion (CoReDi), a framework in which the semantic representation space evolves during training by learning a lightweight linear projection jointly with the diffusion model. While naively optimizing this projection leads to degenerate solutions, we show that stable coevolution can be achieved through a combination of stop-gradient targets, normalization, and targeted regularization that prevents feature collapse. This formulation enables the semantic space to progressively specialize to the needs of image synthesis, improving its complementarity with image latents. We apply CoReDi to both VAE latent diffusion and pixel-space diffusion, demonstrating that adaptive semantic representations improve generative modeling across both settings. Experiments show that CoReDi achieves faster convergence and higher sample quality compared to joint diffusion models operating in fixed representation spaces.

画像と特徴量の共同拡散における共進化的表現

Coevolving Representations in Joint Image-Feature Diffusion

要旨

Support