이미지 생성에서 구형 흐름 매칭을 위한 잠재 기하학 정렬

초록

이미지 생성을 위한 잠재 흐름 매칭은 일반적으로 가우시안 잡음을 선형 경로를 따라 변분 오토인코더의 잠재 변수로 변환한다. 그러나 두 끝점 모두 얇은 구형 껍질에 집중되어 있으며, 전처리로 반지름을 정렬하더라도 유클리드 현(chord)은 그 껍질을 벗어난다. 각 잠재 토큰을 방사(radial) 성분과 각(angular) 성분으로 분해하여, 구성 요소 교환 실험(component-swap probes)을 통해 디코딩된 지각적 및 의미적 내용이 주로 방향에 의해 전달되며 반지름의 기여는 훨씬 적다는 것을 보여준다. 따라서 데이터 잠재 변수를 고정된 토큰 반지름에 투영하고, 가우시안 잡음의 방사 투영을 구형 사전 분포(spherical prior)로 사용하며, 인코더는 고정한 상태로 디코더를 미세 조정하고, 선형 보간을 구형 선형 보간(spherical linear interpolation)으로 대체한다. 결과적인 측지선 경로는 모든 시간 단계에서 구면 위에 유지되며, 그 속도 목표는 구조적으로 순수 각도 성분만을 가진다. 일치된 훈련 조건에서 이 방법은 다양한 이미지 토크나이저에 걸쳐 클래스 조건부 ImageNet-256 FID를 일관되게 개선하며, 확산 아키텍처는 변경하지 않고, 보조 인코더나 표현 정렬 목표가 필요하지 않다.

English

Latent flow matching for image generation usually transports Gaussian noise to variational autoencoder latents along linear paths. Both endpoints, however, concentrate in thin spherical shells, and a Euclidean chord leaves those shells even when preprocessing aligns their radii. By decomposing each latent token into radial and angular components, we show through component-swap probes that decoded perceptual and semantic content is carried predominantly by direction, with radius contributing much less. We therefore project data latents onto a fixed token radius, use the radial projection of Gaussian noise as the spherical prior, finetune the decoder with the encoder frozen, and replace linear interpolation with spherical linear interpolation. The resulting geodesic paths stay on the sphere at every timestep, and their velocity targets are purely angular by construction. Under matched training, the method consistently improves class-conditional ImageNet-256 FID across different image tokenizers, leaves the diffusion architecture unchanged, and requires no auxiliary encoder or representation-alignment objective.