通用正规嵌入

摘要

生成式模型与视觉编码器长期以来基本沿着各自独立的路径发展，其优化目标不同且基于不同的数学原理。然而它们共享一个基本特性：潜在空间的高斯性。生成式模型将高斯噪声映射为图像，而编码器则将图像映射为语义嵌入——其坐标在经验上呈现高斯分布特性。我们假设二者实为同一潜在源的不同视图，即通用正态嵌入（UNE）：这是一个近似高斯分布的潜在空间，编码器嵌入和DDIM逆推噪声均可视为其带噪声的线性投影。为验证此假设，我们构建了NoiseZoo数据集，其中包含每张图像对应的DDIM逆推扩散噪声与匹配的编码器表征（CLIP、DINO）。在CelebA数据集上的实验表明，两个空间中的线性探针均能实现强效且对齐的属性预测，证明生成式噪声沿着线性方向编码了有意义的语义信息。这些线性方向进一步实现了无需改变模型架构的可控图像编辑（如微笑、性别、年龄等），通过简单的正交化处理即可缓解虚假纠缠效应。综合来看，我们的实验结果从实证角度支持了UNE假说，揭示了连接编码与生成过程的共享高斯型潜在几何结构。代码与数据详见https://rbetser.github.io/UNE/。

English

Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections. To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions. These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements. Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation. Code and data are available https://rbetser.github.io/UNE/

通用正规嵌入

The Universal Normal Embedding

摘要

Support