ChatPaper.aiChatPaper

將自監督表徵適應為潛在空間以實現高效生成

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

October 16, 2025
作者: Ming Gui, Johannes Schusterbauer, Timy Phan, Felix Krause, Josh Susskind, Miguel Angel Bautista, Björn Ommer
cs.AI

摘要

我們引入了表示標記器(Representation Tokenizer, RepTok),這是一種生成建模框架,它利用自監督視覺變換器獲得的單一連續潛在標記來表示圖像。基於預訓練的自監督學習(SSL)編碼器,我們僅對語義標記嵌入進行微調,並將其與使用標準流匹配目標聯合訓練的生成解碼器配對。此適應過程豐富了標記,使其包含低層次、與重建相關的細節,從而實現了忠實的圖像重建。為了保持原始SSL空間的優良幾何特性,我們添加了餘弦相似度損失來正則化適應後的標記,確保潛在空間保持平滑且適合生成。我們的單一標記公式解決了二維潛在空間的空間冗餘問題,並顯著降低了訓練成本。儘管RepTok簡單且高效,它在類條件ImageNet生成上取得了競爭力的結果,並自然地擴展到文本到圖像合成,在極其有限的訓練預算下,在MS-COCO上達到了競爭力的零樣本性能。我們的研究結果凸顯了微調SSL表示作為緊湊且有效的潛在空間在高效生成建模中的潛力。
English
We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.
PDF22October 20, 2025