DINO-SAE:面向高保真图像重建与生成的DINO球面自编码器
DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation
January 30, 2026
作者: Hun Chang, Byunghee Cha, Jong Chul Ye
cs.AI
摘要
近期研究探索了将DINO等预训练视觉基础模型(VFM)用于生成式自编码器,展现出强大的生成性能。然而,现有方法常因高频细节丢失而导致重建保真度受限。本文提出DINO球面自编码器(DINO-SAE),该框架能够桥接语义表征与像素级重建。我们的核心发现在于:对比式表征中的语义信息主要编码于特征向量的方向维度,而强制进行严格的幅度匹配反而会阻碍编码器保留细粒度细节。为此,我们设计了分层卷积块嵌入模块以增强局部结构与纹理保持能力,并采用余弦相似度对齐目标函数,在保持语义一致性的同时允许特征幅度灵活变化以保留细节。此外,基于自监督学习基础模型表征本质存在于超球面的观察,我们引入黎曼流匹配技术,直接在该球面潜空间上训练扩散Transformer(DiT)。ImageNet-1K上的实验表明,本方法实现了最先进的重建质量(rFID达0.37,PSNR达26.2 dB),同时与预训练VFM保持强语义对齐。值得注意的是,基于黎曼流匹配的DiT展现出高效收敛特性,在80轮训练周期内gFID指标达到3.47。
English
Recent studies have explored using pretrained Vision Foundation Models (VFMs) such as DINO for generative autoencoders, showing strong generative performance. Unfortunately, existing approaches often suffer from limited reconstruction fidelity due to the loss of high-frequency details. In this work, we present the DINO Spherical Autoencoder (DINO-SAE), a framework that bridges semantic representation and pixel-level reconstruction. Our key insight is that semantic information in contrastive representations is primarily encoded in the direction of feature vectors, while forcing strict magnitude matching can hinder the encoder from preserving fine-grained details. To address this, we introduce Hierarchical Convolutional Patch Embedding module that enhances local structure and texture preservation, and Cosine Similarity Alignment objective that enforces semantic consistency while allowing flexible feature magnitudes for detail retention. Furthermore, leveraging the observation that SSL-based foundation model representations intrinsically lie on a hypersphere, we employ Riemannian Flow Matching to train a Diffusion Transformer (DiT) directly on this spherical latent manifold. Experiments on ImageNet-1K demonstrate that our approach achieves state-of-the-art reconstruction quality, reaching 0.37 rFID and 26.2 dB PSNR, while maintaining strong semantic alignment to the pretrained VFM. Notably, our Riemannian Flow Matching-based DiT exhibits efficient convergence, achieving a gFID of 3.47 at 80 epochs.