IDEAL：深度对齐实现离散表示自编码器

摘要

基于预训练视觉基础模型（VFM）的表征自编码器（RAE）近期已成为构建语义丰富潜空间以用于图像生成的一种有前景的方法。然而，其重建质量往往仍不理想，这主要是因为深层VFM表征未能保留足够的细粒度视觉细节。这种局限性在离散化后更为严重，缺失的低层信息难以恢复。事实上，我们观察到浅层VFM特征保留了更丰富的局部外观和结构细节，这与现有RAE中使用的深层特征所携带的高层语义形成互补。受此互补特性的启发，我们提出了Ideal——一种面向离散表征自编码的深度对齐框架。通过将量化令牌同时与浅层和深层VFM特征对齐，Ideal使得生成的离散视觉令牌能够同时保持视觉保真度和丰富语义。大量实验表明，Ideal实现了卓越的重建性能，在ImageNet上达到0.61的rFID，比此前最佳方法领先0.28。当用于自回归图像生成时，Ideal进一步获得了1.89的gFID，树立了自回归图像生成的新最佳水平。

English

Built on pretrained vision foundation models (VFMs), representation autoencoders (RAEs) have recently emerged as a promising approach for constructing semantically rich latent spaces for image generation. However, their reconstruction quality often remains suboptimal, largely because deep VFM representations do not preserve sufficient fine-grained visual detail. This limitation becomes even more severe after discretization, where missing low-level information is difficult to recover. In fact, we observe that shallow VFM features retain considerably richer local appearance and structural detail, which complements the high-level semantics carried by deep features used in existing RAEs. Motivated by this complementary property, we propose Ideal, an In-depth Alignment framework for discrete representation autoencoding. By jointly aligning quantized tokens with both shallow and deep VFM features, Ideal enables the resulting discrete visual tokens to preserve both visual fidelity and rich semantics. Extensive experiments demonstrate that Ideal yields superior reconstruction performance, achieving 0.61 rFID on ImageNet and outperforming the previous best method by 0.28. When used for autoregressive image generation, Ideal further produces a gFID of 1.89, establishing a new state of the art for autoregressive image generation.