画像トランスフォーマーの表現アライメントは思ったより簡単ではない

要旨

表現アライメント（REPA）は、潜在空間におけるDiffusion Transformerの学習を加速する簡便な手法として登場した。一方、Just image Transformers（JiT）のようなピクセル空間拡散トランスフォーマーは、事前学習済みトークナイザへの依存を排除し、潜在拡散の再構成ボトルネックを回避するため、注目を集めている。本論文では、REPAがJiTに対して失敗し得ることを示す。REPAはJiTにおいて、学習が進むにつれてFIDが悪化し、ImageNetで事前学習された意味エンコーダの表現空間内で密にクラスタリングされた画像サブセットにおいて多様性が崩壊する。この失敗の原因は、情報の非対称性にある。すなわち、ノイズ除去は高次元の画像空間で行われるのに対し、意味ターゲットは強く圧縮されているため、直接回帰が近道目的となってしまうのである。我々は、PixelREPAを提案する。これは、アライメントターゲットを変換し、浅いトランスフォーマーアダプタと部分的なトークンマスキングを組み合わせたMasked Transformer Adapterによってアライメントを制約するものである。PixelREPAは、学習の収束と最終的な品質の両方を改善する。PixelREPAは、JiT-B/16においてFIDを3.66から3.17に改善し、ImageNet 256×256においてInception Score（IS）を275.1から284.6に向上させるとともに、2倍以上の高速な収束を実現する。最後に、PixelREPA-H/16はFID=1.81、IS=317.2を達成する。コードはhttps://github.com/kaist-cvml/PixelREPA で公開されている。

English

Representation Alignment (REPA) has emerged as a simple way to accelerate Diffusion Transformers training in latent space. At the same time, pixel-space diffusion transformers such as Just image Transformers (JiT) have attracted growing attention because they remove a dependency on a pretrained tokenizer, and then avoid the reconstruction bottleneck of latent diffusion. This paper shows that the REPA can fail for JiT. REPA yields worse FID for JiT as training proceeds and collapses diversity on image subsets that are tightly clustered in the representation space of pretrained semantic encoder on ImageNet. We trace the failure to an information asymmetry: denoising occurs in the high dimensional image space, while the semantic target is strongly compressed, making direct regression a shortcut objective. We propose PixelREPA, which transforms the alignment target and constrains alignment with a Masked Transformer Adapter that combines a shallow transformer adapter with partial token masking. PixelREPA improves both training convergence and final quality. PixelREPA reduces FID from 3.66 to 3.17 for JiT-B/16 and improves Inception Score (IS) from 275.1 to 284.6 on ImageNet 256 times 256, while achieving > 2times faster convergence. Finally, PixelREPA-H/16 achieves FID=1.81 and IS=317.2. Our code is available at https://github.com/kaist-cvml/PixelREPA.

画像トランスフォーマーの表現アライメントは思ったより簡単ではない

Representation Alignment for Just Image Transformers is not Easier than You Think

要旨

Support