画像生成における球形フローマッチングのための潜在幾何の調整

要旨

画像生成における潜在フローマッチングは通常、線形経路に沿ってガウスノイズを変分オートエンコーダの潜在変数に輸送する。しかし、両端点は薄い球面殻に集中しており、前処理によってこれらの半径を揃えても、ユークリッド弦はそれらの殻を外れてしまう。各潜在トークンを動径成分と角度成分に分解することで、成分交換プローブにより、復号された知覚的・意味的内容は主に方向によって担われ、動径の寄与ははるかに小さいことが示される。そこで、データの潜在変数を固定トークン半径に投影し、ガウスノイズの動径投影を球面事前分布として、エンコーダを固定したままデコーダを微調整し、線形補間を球面線形補間に置き換える。これにより得られる測地線経路はすべてのタイムステップで球面上に留まり、その速度ターゲットは構成上純粋に角度のみとなる。同等の訓練条件下で、本手法は複数の画像トークナイザにわたってクラス条件付きImageNet-256のFIDを一貫して改善し、拡散アーキテクチャは変更せず、追加のエンコーダや表現整合性の目的関数も必要としない。

English

Latent flow matching for image generation usually transports Gaussian noise to variational autoencoder latents along linear paths. Both endpoints, however, concentrate in thin spherical shells, and a Euclidean chord leaves those shells even when preprocessing aligns their radii. By decomposing each latent token into radial and angular components, we show through component-swap probes that decoded perceptual and semantic content is carried predominantly by direction, with radius contributing much less. We therefore project data latents onto a fixed token radius, use the radial projection of Gaussian noise as the spherical prior, finetune the decoder with the encoder frozen, and replace linear interpolation with spherical linear interpolation. The resulting geodesic paths stay on the sphere at every timestep, and their velocity targets are purely angular by construction. Under matched training, the method consistently improves class-conditional ImageNet-256 FID across different image tokenizers, leaves the diffusion architecture unchanged, and requires no auxiliary encoder or representation-alignment objective.