仅凭图像变换器的表征对齐并非易事

摘要

表征对齐（REPA）已成为加速潜在空间中扩散变换器训练的简便方法。与此同时，像素空间扩散变换器（如纯图像变换器JiT）因摆脱了对预训练分词器的依赖，避免了潜在扩散的重建瓶颈而备受关注。本文发现REPA方法在JiT中可能失效：随着训练进行，REPA会导致JiT的FID指标恶化，并在ImageNet预训练语义编码器表征空间中紧密聚集的图像子集上出现多样性坍缩。我们将其失效根源归结为信息不对称：去噪过程发生在高维图像空间，而语义目标被高度压缩，使得直接回归成为捷径目标。我们提出PixelREPA方法，通过结合浅层变换器适配器与部分令牌掩码的掩码变换器适配器来转换对齐目标并约束对齐过程。PixelREPA同时提升了训练收敛速度和最终生成质量：在ImageNet 256×256数据集上，JiT-B/16的FID从3.66降至3.17，初始分数（IS）从275.1提升至284.6，同时实现超过2倍的收敛加速。最终，PixelREPA-H/16取得了FID=1.81和IS=317.2的优异表现。代码已开源：https://github.com/kaist-cvml/PixelREPA。

English

Representation Alignment (REPA) has emerged as a simple way to accelerate Diffusion Transformers training in latent space. At the same time, pixel-space diffusion transformers such as Just image Transformers (JiT) have attracted growing attention because they remove a dependency on a pretrained tokenizer, and then avoid the reconstruction bottleneck of latent diffusion. This paper shows that the REPA can fail for JiT. REPA yields worse FID for JiT as training proceeds and collapses diversity on image subsets that are tightly clustered in the representation space of pretrained semantic encoder on ImageNet. We trace the failure to an information asymmetry: denoising occurs in the high dimensional image space, while the semantic target is strongly compressed, making direct regression a shortcut objective. We propose PixelREPA, which transforms the alignment target and constrains alignment with a Masked Transformer Adapter that combines a shallow transformer adapter with partial token masking. PixelREPA improves both training convergence and final quality. PixelREPA reduces FID from 3.66 to 3.17 for JiT-B/16 and improves Inception Score (IS) from 275.1 to 284.6 on ImageNet 256 times 256, while achieving > 2times faster convergence. Finally, PixelREPA-H/16 achieves FID=1.81 and IS=317.2. Our code is available at https://github.com/kaist-cvml/PixelREPA.