将自监督表征适配为高效生成的潜在空间
Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
October 16, 2025
作者: Ming Gui, Johannes Schusterbauer, Timy Phan, Felix Krause, Josh Susskind, Miguel Angel Bautista, Björn Ommer
cs.AI
摘要
我们提出了表征分词器(RepTok),这是一种生成建模框架,它利用自监督视觉变换器生成的单一连续潜在令牌来表示图像。基于预训练的自监督学习(SSL)编码器,我们仅对语义令牌嵌入进行微调,并将其与采用标准流匹配目标联合训练的生成解码器配对。这种调整通过融入低层次、与重建相关的细节来丰富令牌,从而实现精确的图像重建。为了保持原始SSL空间的良好几何特性,我们添加了余弦相似度损失来正则化调整后的令牌,确保潜在空间保持平滑且适合生成。我们的单令牌设计解决了二维潜在空间的空间冗余问题,并显著降低了训练成本。尽管RepTok结构简单且高效,它在类别条件ImageNet生成上取得了具有竞争力的结果,并自然扩展到文本到图像合成,在极有限的训练预算下,在MS-COCO上达到了竞争性的零样本性能。我们的研究结果凸显了微调SSL表征作为紧凑且有效的潜在空间在高效生成建模中的潜力。
English
We introduce Representation Tokenizer (RepTok), a generative modeling
framework that represents an image using a single continuous latent token
obtained from self-supervised vision transformers. Building on a pre-trained
SSL encoder, we fine-tune only the semantic token embedding and pair it with a
generative decoder trained jointly using a standard flow matching objective.
This adaptation enriches the token with low-level, reconstruction-relevant
details, enabling faithful image reconstruction. To preserve the favorable
geometry of the original SSL space, we add a cosine-similarity loss that
regularizes the adapted token, ensuring the latent space remains smooth and
suitable for generation. Our single-token formulation resolves spatial
redundancies of 2D latent spaces and significantly reduces training costs.
Despite its simplicity and efficiency, RepTok achieves competitive results on
class-conditional ImageNet generation and naturally extends to text-to-image
synthesis, reaching competitive zero-shot performance on MS-COCO under
extremely limited training budgets. Our findings highlight the potential of
fine-tuned SSL representations as compact and effective latent spaces for
efficient generative modeling.