V-Co：共同ノイズ除去による視覚表現アライメントの詳細分析

要旨

ピクセル空間拡散モデルは近年、事前学習済みオートエンコーダを必要とせず高品質な生成を可能とする、潜在拡散モデルの有力な代替手法として再注目されている。しかし、標準的なピクセル空間拡散モデルは意味的な監督信号が比較的弱く、高次元の視覚的構造を捉えるよう明示的に設計されていない。近年の表現アライメント手法（REPAなど）は、事前学習済み視覚特徴が拡散訓練を大幅に改善し得ることを示唆しており、視覚的共脱ノイズ処理（visual co-denoising）はそのような特徴を生成過程に組み込む有望な方向性として登場した。しかし、既存の共脱ノイズ手法は複数の設計選択が絡み合っており、どの設計選択が真に本質的であるかが不明である。そこで本論文では、統一されたJiTベースのフレームワークにおいて、視覚的共脱ノイズ処理を体系的に検討したV-Coを提案する。この制御された設定により、視覚的共脱ノイズ処理を効果的にする要素を分離して特定できる。我々の研究は、効果的な視覚的共脱ノイズ処理に必要な4つの重要要素を明らかにした。第一に、特徴量特有の計算を保持しつつ柔軟なクロスストリーム相互作用を可能とするため、完全デュアルストリームアーキテクチャが動機付けられる。第二に、効果的な分類器不要ガイダンス（CFG）には、構造的に定義された無条件予測が必要である。第三に、より強力な意味的監督は知覚的ドリフトを考慮したハイブリッド損失によって最も良く提供される。第四に、安定した共脱ノイズ処理には適切なクロスストリーム較正がさらに必要であり、我々はRMSベースの特徴量再スケーリングによってこれを実現する。これらの知見を統合することで、視覚的共脱ノイズ処理のための簡潔な設計指針が得られる。ImageNet-256における実験では、同等のモデルサイズにおいて、V-Coは基盤となるピクセル空間拡散ベースライン及び強力な先行ピクセル拡散手法を、より少ない訓練エポック数で凌駕し、将来の表現アライメントされた生成モデルに対する実践的な指針を提供する。

English

Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.

V-Co：共同ノイズ除去による視覚表現アライメントの詳細分析

V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

要旨

Support