ChatPaper.aiChatPaper

单层足矣:面向图像生成的预训练视觉编码器适配研究

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

December 8, 2025
作者: Yuan Gao, Chen Chen, Tianrong Chen, Jiatao Gu
cs.AI

摘要

视觉生成模型(如扩散模型)通常运行在压缩潜空间中以平衡训练效率与样本质量。与此同时,利用高质量预训练视觉表征的研究日益受到关注,无论是通过将其对齐在VAE内部,还是直接整合进生成模型。然而,由于理解导向的特征与生成友好的潜空间之间存在根本性错配,适配这类表征仍具挑战性:表征编码器受益于能捕捉掩码区域多元假设的高维潜变量,而生成模型则青睐能忠实保留注入噪声的低维潜变量。这种差异导致先前研究不得不依赖复杂的目标函数和架构。本文提出FAE(特征自编码器),该框架通过仅需单个注意力层的极简设计,将预训练视觉表征适配为适用于生成的低维潜变量,同时保留足够信息以支持重建和理解任务。其核心在于耦合两个独立的深度解码器:一个经训练重建原始特征空间,另一个则以重建特征作为输入进行图像生成。FAE具有通用性,可与多种自监督编码器(如DINO、SigLIP)结合实例化,并嵌入两类不同的生成模型家族:扩散模型与标准化流。在类别条件生成和文生图基准测试中,FAE均表现出强劲性能。以ImageNet 256×256为例,我们搭载分类器引导的扩散模型取得了接近最优的FID指标(800轮训练1.29,80轮训练1.70);无分类器引导时,FAE更达到当前最优FID(800轮1.48,80轮2.08),展现出高质量与快速学习的双重优势。
English
Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer, while retaining sufficient information for both reconstruction and understanding. The key is to couple two separate deep decoders: one trained to reconstruct the original feature space, and a second that takes the reconstructed features as input for image generation. FAE is generic; it can be instantiated with a variety of self-supervised encoders (e.g., DINO, SigLIP) and plugged into two distinct generative families: diffusion models and normalizing flows. Across class-conditional and text-to-image benchmarks, FAE achieves strong performance. For example, on ImageNet 256x256, our diffusion model with CFG attains a near state-of-the-art FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, FAE reaches the state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.
PDF42December 10, 2025