普遍的正規埋め込み

要旨

生成的モデルと視覚エンコーダは、これまで異なる目標に最適化され、異なる数学的原理に基づいて別々に発展してきました。しかし両者には、潜在空間のガウス性という根本的な共通点があります。生成的モデルはガウスノイズを画像に写像し、エンコーダは画像を意味的埋め込みに写像しますが、その座標は経験的にガウス分布に従います。本研究では、両者が共通の潜在源「Universal Normal Embedding（UNE）」の異なる表現であると仮説を立てます。UNEは近似ガウス的な潜在空間であり、エンコーダの埋め込みとDDIM逆変換ノイズは、これに対する線形射影として生じると考えられます。この仮説を検証するため、DDIM逆拡散ノイズと対応するエンコーダ表現（CLIP、DINO）から構成される画像単位の潜在変数データセット「NoiseZoo」を構築しました。CelebAデータセットにおいて、両空間での線形プローブは強力かつ整合性のある属性予測を実現し、生成的ノイズが線形方向に意味情報を符号化していることを示唆しました。これらの方向性を利用することで、アーキテクチャ変更なしに忠実な制御編集（笑顔、性別、年齢など）が可能となり、単純な直交化処理によって偽の絡み合いを軽減できました。総合的に、本研究結果はUNE仮説を実証的に支持し、符号化と生成を具体的に結び付けるガウス型潜在幾何学の共通性を明らかにしています。コードとデータはhttps://rbetser.github.io/UNE/で公開されています。

English

Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections. To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions. These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements. Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation. Code and data are available https://rbetser.github.io/UNE/

普遍的正規埋め込み

The Universal Normal Embedding

要旨

Support