V-Co: 공동 잡음 제거를 통한 시각 표현 정렬 심층 분석

초록

픽셀 공간 디퓨전은 사전 훈련된 오토인코더 없이도 고품질 생성을 가능하게 하며, 최근 잠재 디퓨전의 강력한 대안으로 다시 부상하고 있습니다. 그러나 표준 픽셀 공간 디퓨전 모델은 상대적으로 약한 의미론적 지도를 받으며 고수준 시각적 구조를 포착하도록 명시적으로 설계되지 않았습니다. 최근의 표현 정렬 방법(예: REPA)은 사전 훈련된 시각적 특징이 디퓨전 훈련을 크게 개선할 수 있음을 시사하며, 시각적 공동 잡음 제거는 이러한 특징을 생성 과정에 통합하기 위한 유망한 방향으로 등장했습니다. 그러나 기존 공동 잡음 제거 접근법은 여러 설계 선택을 종종 복잡하게 얽히게 만들어 어떤 설계 선택이 진정으로 필수적인지 불분명하게 합니다. 따라서 우리는 통합 Just-in-Time 기반 프레임워크 내에서 시각적 공동 잡음 제거에 대한 체계적인 연구인 V-Co를 제시합니다. 이 통제된 환경을 통해 우리는 시각적 공동 잡음 제거의 효과를 결정하는 핵심 요소를 분리할 수 있습니다. 우리의 연구는 효과적인 시각적 공동 잡음 제거를 위한 네 가지 핵심 요소를 밝혀냈습니다. 첫째, 특징별 계산을 보존하면서 유연한 교차 스트림 상호작용을 가능하게 하는 완전 이중 스트림 아키텍처가 필요합니다. 둘째, 효과적인 분류자 무료 지도는 구조적으로 정의된 무조건 예측을 요구합니다. 셋째, 더 강력한 의미론적 지도는 지각적 드리프트 하이브리드 손실을 통해 최적으로 제공됩니다. 넷째, 안정적인 공동 잡음 제거는 적절한 교차 스트림 보정을 추가로 필요로 하며, 우리는 RMS 기반 특징 재조정을 통해 이를 구현합니다. 이러한 발견들을 종합하면 시각적 공동 잡음 제거를 위한 간단한 방법론을 도출할 수 있습니다. ImageNet-256에 대한 실험 결과, 동일한 모델 크기 기준으로 V-Co는 기반 픽셀 공간 디퓨전 기준 모델 및 강력한 기존 픽셀 디퓨전 방법들을 더 적은 훈련 에포크를 사용하면서도 능가하여, 향후 표현 정렬 생성 모델을 위한 실용적인 지침을 제공합니다.

English

Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.

V-Co: 공동 잡음 제거를 통한 시각 표현 정렬 심층 분석

V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

초록

Support