DINO-SAE: Autoencoder Sferico DINO per la Ricostruzione e Generazione di Immagini ad Alta Fedeltà

Abstract

Studi recenti hanno esplorato l'utilizzo di Vision Foundation Models (VFM) preaddestrati come DINO per autoencoder generativi, dimostrando prestazioni generative robuste. Sfortunatamente, gli approcci esistenti spesso presentano una fedeltà di ricostruzione limitata a causa della perdita di dettagli ad alta frequenza. In questo lavoro, presentiamo il DINO Spherical Autoencoder (DINO-SAE), un framework che colma il divario tra rappresentazione semantica e ricostruzione a livello di pixel. La nostra intuizione chiave è che l'informazione semantica nelle rappresentazioni contrastive è codificata principalmente nella direzione dei vettori di feature, mentre forzare una corrispondenza rigorosa della magnitudine può impedire all'encoder di preservare i dettagli più fini. Per affrontare ciò, introduciamo un modulo di Hierarchical Convolutional Patch Embedding che migliora la preservazione di strutture locali e trame, e un obiettivo di Cosine Similarity Alignment che impone la consistenza semantica consentendo al contempo magnitudini delle feature flessibili per la ritenzione dei dettagli. Inoltre, sfruttando l'osservazione che le rappresentazioni dei foundation model basati su SSL giacciono intrinsecamente su un'ipersfera, utilizziamo il Riemannian Flow Matching per addestrare un Diffusion Transformer (DiT) direttamente su questa varietà latente sferica. Esperimenti su ImageNet-1K dimostrano che il nostro approccio raggiunge una qualità di ricostruzione allo stato dell'arte, con 0.37 rFID e 26.2 dB PSNR, mantenendo al contempo un forte allineamento semantico con il VFM preaddestrato. In modo significativo, il nostro DiT basato su Riemannian Flow Matching mostra una convergenza efficiente, raggiungendo un gFID di 3.47 a 80 epoche.

English

Recent studies have explored using pretrained Vision Foundation Models (VFMs) such as DINO for generative autoencoders, showing strong generative performance. Unfortunately, existing approaches often suffer from limited reconstruction fidelity due to the loss of high-frequency details. In this work, we present the DINO Spherical Autoencoder (DINO-SAE), a framework that bridges semantic representation and pixel-level reconstruction. Our key insight is that semantic information in contrastive representations is primarily encoded in the direction of feature vectors, while forcing strict magnitude matching can hinder the encoder from preserving fine-grained details. To address this, we introduce Hierarchical Convolutional Patch Embedding module that enhances local structure and texture preservation, and Cosine Similarity Alignment objective that enforces semantic consistency while allowing flexible feature magnitudes for detail retention. Furthermore, leveraging the observation that SSL-based foundation model representations intrinsically lie on a hypersphere, we employ Riemannian Flow Matching to train a Diffusion Transformer (DiT) directly on this spherical latent manifold. Experiments on ImageNet-1K demonstrate that our approach achieves state-of-the-art reconstruction quality, reaching 0.37 rFID and 26.2 dB PSNR, while maintaining strong semantic alignment to the pretrained VFM. Notably, our Riemannian Flow Matching-based DiT exhibits efficient convergence, achieving a gFID of 3.47 at 80 epochs.

DINO-SAE: Autoencoder Sferico DINO per la Ricostruzione e Generazione di Immagini ad Alta Fedeltà

DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation

Abstract

Support