임베디드 표현 워밍업을 통한 효율적인 생성 모델 학습

초록

확산 모델(Diffusion models)은 고차원 데이터 생성에 뛰어나지만, 자기 지도 학습(self-supervised) 방법들에 비해 학습 효율성과 표현 품질에서 뒤처집니다. 우리는 이러한 문제의 핵심 병목 현상을 발견했습니다: 학습 과정에서 고품질의 의미론적으로 풍부한 표현이 충분히 활용되지 않아 수렴 속도가 크게 느려지는 것입니다. 체계적인 분석을 통해, 생성이 이루어지기 전에 의미론적 및 구조적 패턴 학습이 일어나는 중요한 표현 처리 영역(representation processing region) — 주로 초기 층에서 — 을 확인했습니다. 이를 해결하기 위해, 우리는 임베디드 표현 워밍업(Embedded Representation Warmup, ERW)이라는 플러그 앤 플레이 프레임워크를 제안합니다. 이 프레임워크의 첫 번째 단계에서 ERW 모듈은 워밍업 역할을 하며, 확산 모델의 초기 층을 고품질의 사전 학습된 표현으로 초기화합니다. 이 워밍업은 처음부터 표현을 학습해야 하는 부담을 최소화함으로써 수렴 속도를 가속화하고 성능을 향상시킵니다. 우리의 이론적 분석은 ERW의 효과가 신경망의 특정 층 — 표현 처리 영역이라고 명명된 — 에 정확히 통합될 때 가장 크게 나타남을 보여줍니다. 이 영역은 모델이 후속 생성을 위해 주로 특징 표현을 처리하고 변환하는 곳입니다. 또한, ERW가 학습 수렴 속도를 가속화할 뿐만 아니라 표현 품질도 향상시킴을 입증했습니다: 실험적으로, 우리의 방법은 현재 최신 기술인 REPA에 비해 학습 속도에서 40배의 가속을 달성했습니다. 코드는 https://github.com/LINs-lab/ERW에서 확인할 수 있습니다.

English

Diffusion models excel at generating high-dimensional data but fall short in training efficiency and representation quality compared to self-supervised methods. We identify a key bottleneck: the underutilization of high-quality, semantically rich representations during training notably slows down convergence. Our systematic analysis reveals a critical representation processing region -- primarily in the early layers -- where semantic and structural pattern learning takes place before generation can occur. To address this, we propose Embedded Representation Warmup (ERW), a plug-and-play framework where in the first stage we get the ERW module serves as a warmup that initializes the early layers of the diffusion model with high-quality, pretrained representations. This warmup minimizes the burden of learning representations from scratch, thereby accelerating convergence and boosting performance. Our theoretical analysis demonstrates that ERW's efficacy depends on its precise integration into specific neural network layers -- termed the representation processing region -- where the model primarily processes and transforms feature representations for later generation. We further establish that ERW not only accelerates training convergence but also enhances representation quality: empirically, our method achieves a 40times acceleration in training speed compared to REPA, the current state-of-the-art methods. Code is available at https://github.com/LINs-lab/ERW.

임베디드 표현 워밍업을 통한 효율적인 생성 모델 학습

Efficient Generative Model Training via Embedded Representation Warmup

초록

Support