JLT: 잠재 확산 트랜스포머에서의 클린-잠재 변수 예측

초록

흐름 매칭(flow matching)에서 청정 데이터 예측(clean-data prediction)은 원시 점(clean point)에 대한 회귀가 주변 잡음이 포함된 양(ambient noised quantity)을 예측하는 것보다 저차원 구조를 더 효과적으로 활용할 수 있음을 보여주었다. 우리는 이미지가 학습된 잠재 공간으로 매핑된 후, 압축이 이미 원시 픽셀 변동성의 대부분을 제거한 상황에서 이 원리가 여전히 유용한지 질문한다. 우리는 고정된 FLUX.2 VAE 코드 위에 구축된 130M 규모의 잠재 확산 트랜스포머인 JLT를 소개하고, 동일한 표현, 백본, 훈련 설정 하에서 청정-잠재 예측(clean-latent prediction)과 정합된 속도-예측 DiT(velocity-prediction DiT)를 비교한다. 세 변수 x, 엡실론, v가 고정된 변형 시간(corruption time)에 대해 선형적으로 변환 가능하지만, 국소 가우시안 분석은 속도 회귀(velocity regression)가 등방성 목표-공분산 하한(isotropic target-covariance floor)을 상속받고 낮은 분산의 잠재 방향을 증폭시키는 반면, 청정 예측(clean prediction)은 이를 감쇠시킴을 보여준다. ImageNet 256 x 256에서, JLT-B/1은 분류기-자유 유도(classifier-free guidance)를 사용하여 FID-50K 2.50을 달성하며, 속도 예측에 비해 큰 정합-목표 차이(matched-target gap)를 보인다. 이러한 결과는 잠재 확산(latent diffusion)에서 예측 대상이 상호 교환 가능한 대수적 매개변수화(interchangeable algebraic parameterizations)라기보다는 표현에 의존적인 기하학적 선택(representation-dependent geometric choices)임을 시사한다.

English

Flow matching with clean-data prediction has shown that regressing the clean point can exploit low-dimensional structure more effectively than predicting an ambient noised quantity. We ask whether this principle remains useful after images are mapped into a learned latent space, where compression has already removed much of the raw pixel variability. We introduce JLT, a 130M latent diffusion Transformer over frozen FLUX.2 VAE codes, and compare clean-latent prediction with a matched velocity-prediction DiT under the same representation, backbone, and training settings. Although the three variables x, epsilon, and v are linearly convertible for a fixed corruption time, a local Gaussian analysis shows that velocity regression inherits an isotropic target-covariance floor and amplifies low-variance latent directions, while clean prediction damps them. On ImageNet 256 x 256, JLT-B/1 obtains FID-50K 2.50 with classifier-free guidance, with a large matched-target gap over velocity prediction. These results suggest that prediction targets in latent diffusion are representation-dependent geometric choices, rather than interchangeable algebraic parameterizations.