V-Co：通过协同去噪探究视觉表征对齐机制

摘要

像素空间扩散模型近期重新成为潜空间扩散的重要替代方案，无需预训练自编码器即可实现高质量生成。然而，标准像素空间扩散模型获得的语义监督相对较弱，且未显式设计用于捕捉高层视觉结构。近期表征对齐方法（如REPA）表明，预训练视觉特征能显著改进扩散训练，视觉协同去噪已成为将此类特征融入生成过程的重要方向。但现有协同去噪方法常混杂多种设计选择，难以辨明关键要素。为此，我们提出V-Co——基于统一即时训练框架的视觉协同去噪系统研究。该受控设置可分离出影响协同去噪效果的核心要素。研究发现有效视觉协同去噪需具备四个关键要素：首先，保持特征专属计算并实现灵活跨流交互需采用完全双流架构；其次，有效的无分类器引导需结构化的无条件预测；第三，最强语义监督应通过感知漂移混合损失实现；第四，稳定协同去噪还需跨流校准，我们通过基于RMS的特征重缩放实现。这些发现共同构成了视觉协同去噪的简明方案。ImageNet-256实验表明，在模型规模相近时，V-Co在减少训练轮次的情况下超越了基础像素空间扩散基线及先进先验像素扩散方法，为未来表征对齐生成模型提供了实用指导。

English

Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.

V-Co：通过协同去噪探究视觉表征对齐机制

V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

摘要

Support