自己教師あり表現を潜在空間として適応し効率的な生成を実現する

要旨

本論文では、自己教師あり視覚トランスフォーマーから得られる単一の連続潜在トークンを用いて画像を表現する生成モデリングフレームワークであるRepresentation Tokenizer（RepTok）を提案する。事前学習されたSSLエンコーダを基盤として、セマンティックトークン埋め込みのみをファインチューニングし、標準的なフローマッチング目的関数を用いて共同で訓練された生成デコーダと組み合わせる。この適応により、トークンは低レベルの再構成関連の詳細を豊かにし、忠実な画像再構成を可能にする。元のSSL空間の良好な幾何学的特性を維持するために、適応されたトークンを正則化するコサイン類似度損失を追加し、潜在空間が滑らかで生成に適した状態を保つようにする。単一トークンの定式化により、2D潜在空間の空間的冗長性が解消され、訓練コストが大幅に削減される。簡潔さと効率性にもかかわらず、RepTokはクラス条件付きImageNet生成において競争力のある結果を達成し、極めて限られた訓練予算下でMS-COCOにおけるゼロショット性能においても競争力のある結果を示す。本研究の成果は、ファインチューニングされたSSL表現が、効率的な生成モデリングのためのコンパクトで効果的な潜在空間としての可能性を強調している。

English

We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.

自己教師あり表現を潜在空間として適応し効率的な生成を実現する

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

要旨

Support