L'Embedding Normale Universale

Abstract

I modelli generativi e gli encoder visivi hanno progredito in gran parte su binari separati, ottimizzati per obiettivi diversi e basati su principi matematici differenti. Tuttavia, condividono una proprietà fondamentale: la Gaussianità dello spazio latente. I modelli generativi mappano rumore Gaussiano in immagini, mentre gli encoder mappano immagini in embedding semantici le cui coordinate si comportano empiricamente come Gaussiani. Ipotesizziamo che entrambi siano visioni di una sorgente latente condivisa, l'Universal Normal Embedding (UNE): uno spazio latente approssimativamente Gaussiano da cui gli embedding degli encoder e il rumore invertito tramite DDIM emergono come proiezioni lineari rumorose. Per testare la nostra ipotesi, introduciamo NoiseZoo, un dataset di latenti per immagine comprendente il rumore di diffusione invertito con DDIM e le corrispondenti rappresentazioni dell'encoder (CLIP, DINO). Su CelebA, probe lineari in entrambi gli spazi producono previsioni forti e allineate degli attributi, indicando che il rumore generativo codifica semantiche significative lungo direzioni lineari. Queste direzioni abilitano inoltre modifiche controllate e fedeli (ad esempio, sorriso, genere, età) senza modifiche architetturali, dove una semplice ortogonalizzazione mitiga gli intrecci spurii. Nel complesso, i nostri risultati forniscono supporto empirico all'ipotesi UNE e rivelano una geometria latente condivisa di tipo Gaussiano che collega concretamente codifica e generazione. Codice e dati sono disponibili su https://rbetser.github.io/UNE/

English

Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections. To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions. These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements. Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation. Code and data are available https://rbetser.github.io/UNE/

L'Embedding Normale Universale

The Universal Normal Embedding

Abstract

Support