通用规范嵌入

摘要

生成模型与视觉编码器长期以来沿着不同轨迹发展，分别针对不同目标进行优化并基于不同的数学原理。然而它们共享一个基本特性：潜在空间的高斯性。生成模型将高斯噪声映射为图像，而编码器将图像映射为语义嵌入向量——其坐标在经验上呈现高斯分布特性。我们假设二者实为同一潜在源的不同视图，即通用正态嵌入（UNE）：这是一个近似高斯分布的潜在空间，编码器嵌入和DDIM逆扩散噪声均可视为其带噪声的线性投影。为验证该假设，我们构建了NoiseZoo数据集，包含每张图像对应的DDIM逆扩散噪声与匹配的编码器表征（CLIP、DINO）。在CelebA数据集上的实验表明，两个空间中的线性探针均能实现强相关且对齐的属性预测，证明生成噪声沿线性方向编码了有意义的语义信息。这些方向进一步实现了无需架构修改的精准可控编辑（如微笑、性别、年龄），通过简单正交化即可缓解虚假纠缠。综合来看，我们的研究结果为UNE假设提供了实证支持，揭示了连接编码与生成的共享高斯型潜在几何结构。代码与数据详见https://rbetser.github.io/UNE/

English

Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections. To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions. These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements. Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation. Code and data are available https://rbetser.github.io/UNE/