影像生成中球面流匹配的潛在幾何對齊

摘要

潛在流匹配在影像生成中通常沿線性路徑將高斯噪聲轉換為變分自編碼器的潛在向量。然而，兩個端點均集中於薄球殼中，即便透過預處理對齊其半徑，歐幾里得弦仍會偏離這些球殼。藉由將每個潛在標記分解為徑向與角分量，我們透過分量交換探測實驗顯示，解碼後的感知與語義內容主要由方向承載，而半徑貢獻甚微。因此，我們將資料潛在向量投影至固定標記半徑，使用高斯噪聲的徑向投影作為球形先驗，凍結編碼器並微化解碼器，並以球面線性內插取代線性內插。由此產生的測地線路徑在每個時間步長均維持在球面上，且其速度目標依結構設計純為角向。在匹配訓練條件下，該方法在不同影像標記器中一致改善了類別條件式ImageNet-256的FID值，不變更擴散架構，亦無需輔助編碼器或表徵對齊目標。

English

Latent flow matching for image generation usually transports Gaussian noise to variational autoencoder latents along linear paths. Both endpoints, however, concentrate in thin spherical shells, and a Euclidean chord leaves those shells even when preprocessing aligns their radii. By decomposing each latent token into radial and angular components, we show through component-swap probes that decoded perceptual and semantic content is carried predominantly by direction, with radius contributing much less. We therefore project data latents onto a fixed token radius, use the radial projection of Gaussian noise as the spherical prior, finetune the decoder with the encoder frozen, and replace linear interpolation with spherical linear interpolation. The resulting geodesic paths stay on the sphere at every timestep, and their velocity targets are purely angular by construction. Under matched training, the method consistently improves class-conditional ImageNet-256 FID across different image tokenizers, leaves the diffusion architecture unchanged, and requires no auxiliary encoder or representation-alignment objective.