语义与重建并重:让表征编码器为文本到图像生成与编辑做好准备
Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing
December 19, 2025
作者: Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, Ping Luo
cs.AI
摘要
现代潜在扩散模型(LDM)通常在低级变分自编码器(VAE)的潜在空间中运行,这类空间主要针对像素级重建进行优化。为统一视觉生成与理解任务,新兴趋势是采用表征编码器的高维特征作为生成潜变量。然而我们通过实证研究发现该范式存在两个根本性障碍:(1)判别性特征空间缺乏紧凑正则化,导致扩散模型易产生流形外潜变量,进而引发物体结构失真;(2)编码器固有的弱像素级重建能力阻碍生成器学习精确的细粒度几何结构与纹理。本文提出系统化框架,将面向理解任务的编码器特征适配于生成任务。我们引入语义-像素联合重建目标以正则化潜在空间,使语义信息与细粒度细节能共同压缩至高度紧凑的表征(96个通道,16×16空间下采样)。该设计既确保潜在空间保持语义丰富性并实现最优图像重建,又维持足够紧凑性以支持精确生成。基于此表征,我们设计了统一的文生图(T2I)与图像编辑模型。通过多特征空间基准测试表明,本方法在重建质量上达到最优水平,具有更快的收敛速度,并在T2I与编辑任务中实现显著性能提升,验证了表征编码器可有效转化为鲁棒的生成组件。
English
Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.