단순하다 생각하기 쉬운 이미지 트랜스포머의 표현 정렬

초록

표현 정렬(REPA)은 잠재 공간에서 Diffusion Transformer 훈련을 가속화하는 간단한 방법으로 부상했습니다. 동시에 Just image Transformers(JiT)와 같은 픽셀 공간 diffusion transformer는 사전 훈련된 토크나이저에 대한 의존성을 제거하여 잠재 diffusion의 재구성 병목 현상을 회피하기 때문에 점점 더 많은 관심을 끌고 있습니다. 본 논문은 REPA가 JiT에 대해 실패할 수 있음을 보여줍니다. REPA는 훈련이 진행됨에 따라 JiT의 FID를 악화시키고, ImageNet에 대해 사전 훈련된 의미 인코더의 표현 공간에서 밀집된 이미지 하위 집합에 대한 다양성을 붕괴시킵니다. 우리는 이러한 실패의 원인을 정보 비대칭성으로 추적합니다. 즉, 노이즈 제거는 고차원 이미지 공간에서 발생하는 반면, 의미 타겟은 강력하게 압축되어 직접 회귀를 단축 목표로 만듭니다. 우리는 정렬 타겟을 변환하고 얕은 transformer 어댑터와 부분 토큰 마스킹을 결합한 Masked Transformer Adapter로 정렬을 제약하는 PixelREPA를 제안합니다. PixelREPA는 훈련 수렴성과 최종 품질을 모두 향상시킵니다. PixelREPA는 JiT-B/16의 FID를 3.66에서 3.17로 낮추고, ImageNet 256x256에서 Inception Score(IS)를 275.1에서 284.6으로 향상시키며, 2배 이상 빠른 수렴을 달성합니다. 마지막으로 PixelREPA-H/16은 FID=1.81과 IS=317.2를 달성합니다. 우리의 코드는 https://github.com/kaist-cvml/PixelREPA에서 확인할 수 있습니다.

English

Representation Alignment (REPA) has emerged as a simple way to accelerate Diffusion Transformers training in latent space. At the same time, pixel-space diffusion transformers such as Just image Transformers (JiT) have attracted growing attention because they remove a dependency on a pretrained tokenizer, and then avoid the reconstruction bottleneck of latent diffusion. This paper shows that the REPA can fail for JiT. REPA yields worse FID for JiT as training proceeds and collapses diversity on image subsets that are tightly clustered in the representation space of pretrained semantic encoder on ImageNet. We trace the failure to an information asymmetry: denoising occurs in the high dimensional image space, while the semantic target is strongly compressed, making direct regression a shortcut objective. We propose PixelREPA, which transforms the alignment target and constrains alignment with a Masked Transformer Adapter that combines a shallow transformer adapter with partial token masking. PixelREPA improves both training convergence and final quality. PixelREPA reduces FID from 3.66 to 3.17 for JiT-B/16 and improves Inception Score (IS) from 275.1 to 284.6 on ImageNet 256 times 256, while achieving > 2times faster convergence. Finally, PixelREPA-H/16 achieves FID=1.81 and IS=317.2. Our code is available at https://github.com/kaist-cvml/PixelREPA.

단순하다 생각하기 쉬운 이미지 트랜스포머의 표현 정렬

Representation Alignment for Just Image Transformers is not Easier than You Think

초록

Support