DecQ: 用于表示自编码器中增强重建与生成的细节浓缩查询

摘要

表示自编码器（RAE）利用冻结的视觉基础模型（VFM）作为分词器编码器，提供鲁棒的高层表示，从而促进潜在扩散模型中的快速收敛与高质量生成。然而，冻结VFM本质上限制了其空间重建能力，制约了细粒度生成与图像编辑；相反，通过微调引入面向重建的信号会破坏预训练的语义空间，降低生成保真度。为解决这一权衡问题，我们提出DecQ——一种简洁而有效的RAE框架。具体而言，DecQ通过凝聚模块从VFM中间特征中提取细粒度信息，引入轻量级的细节凝聚查询。这些查询被整合到解码器中以支持重建，并在生成建模过程中与图像块标记一同生成。通过聚合浅层与深层信息，DecQ有效缓解了重建与生成之间的权衡，提升了重建质量与生成性能。实验表明：（1）仅需额外8个查询和3.9%的计算开销，DecQ即可在基于冻结DINOv2的RAE上将重建PSNR从19.13 dB提升至22.76 dB；（2）在生成建模方面，DecQ的收敛速度比RAE快3.3倍，无引导条件下FID达1.41，有引导条件下FID达1.05。

English

Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3times faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.