仅针对图像变换器的表征对齐并非易事

摘要

表征对齐（REPA）作为一种在潜空间加速扩散变换器训练的简便方法崭露头角。与此同时，像素空间扩散变换器（如纯图像变换器JiT）因摆脱了对预训练分词器的依赖，从而规避了潜扩散的重建瓶颈，正受到越来越多关注。本文揭示REPA方法在JiT框架中可能失效的现象：随着训练进行，REPA会导致JiT的FID指标恶化，并在ImageNet预训练语义编码器的表征空间中紧密聚集的图像子集上出现多样性坍缩。我们将其失效根源追溯至信息不对称性——去噪过程发生在高维图像空间，而语义目标被高度压缩，使得直接回归成为捷径目标。我们提出PixelREPA方法，通过结合浅层变换器适配器与部分令牌掩码的掩码变换器适配器来转换对齐目标并约束对齐过程。PixelREPA同时提升了训练收敛速度与最终生成质量：在ImageNet 256×256数据集上，JiT-B/16模型的FID从3.66降至3.17，初始分数（IS）从275.1提升至284.6，同时实现超过2倍的收敛加速。最终，PixelREPA-H/16模型取得了FID=1.81和IS=317.2的优异表现。代码已开源：https://github.com/kaist-cvml/PixelREPA。

English

Representation Alignment (REPA) has emerged as a simple way to accelerate Diffusion Transformers training in latent space. At the same time, pixel-space diffusion transformers such as Just image Transformers (JiT) have attracted growing attention because they remove a dependency on a pretrained tokenizer, and then avoid the reconstruction bottleneck of latent diffusion. This paper shows that the REPA can fail for JiT. REPA yields worse FID for JiT as training proceeds and collapses diversity on image subsets that are tightly clustered in the representation space of pretrained semantic encoder on ImageNet. We trace the failure to an information asymmetry: denoising occurs in the high dimensional image space, while the semantic target is strongly compressed, making direct regression a shortcut objective. We propose PixelREPA, which transforms the alignment target and constrains alignment with a Masked Transformer Adapter that combines a shallow transformer adapter with partial token masking. PixelREPA improves both training convergence and final quality. PixelREPA reduces FID from 3.66 to 3.17 for JiT-B/16 and improves Inception Score (IS) from 275.1 to 284.6 on ImageNet 256 times 256, while achieving > 2times faster convergence. Finally, PixelREPA-H/16 achieves FID=1.81 and IS=317.2. Our code is available at https://github.com/kaist-cvml/PixelREPA.

仅针对图像变换器的表征对齐并非易事

Representation Alignment for Just Image Transformers is not Easier than You Think

摘要

Support