JLT: 潜在拡散トランスフォーマーにおけるクリーン潜在予測

要旨

クリーンデータ予測を用いたフローマッチングは、クリーンな点への回帰が、周辺のノイズを含んだ量を予測するよりも、低次元構造をより効果的に活用できることを示している。我々は、画像を学習された潜在空間に写像した後でも、圧縮によって生の画素の変動の多くが除去されている状況で、この原理が依然として有用であるかを問う。我々はJLTを導入する。これは、凍結されたFLUX.2 VAE符号上で動作する130Mパラメータの潜在拡散Transformerであり、同一の表現、バックボーン、学習設定の下で、クリーン潜在変数予測と、それに対応する速度予測DiTを比較する。3つの変数x、ε、vは固定された破損時刻において線形変換可能であるが、局所ガウス解析により、速度回帰は等方的な目標共分散の下限を継承し、低分散の潜在方向を増幅する一方、クリーン予測はそれらを減衰させることが示される。ImageNet 256×256において、JLT-B/1は分類器不要ガイダンスによりFID-50K 2.50を達成し、速度予測に対して大きな一致した目標間のギャップを示す。これらの結果は、潜在拡散における予測目標が、互換可能な代数的パラメータ化ではなく、表現に依存する幾何学的選択であることを示唆している。

English

Flow matching with clean-data prediction has shown that regressing the clean point can exploit low-dimensional structure more effectively than predicting an ambient noised quantity. We ask whether this principle remains useful after images are mapped into a learned latent space, where compression has already removed much of the raw pixel variability. We introduce JLT, a 130M latent diffusion Transformer over frozen FLUX.2 VAE codes, and compare clean-latent prediction with a matched velocity-prediction DiT under the same representation, backbone, and training settings. Although the three variables x, epsilon, and v are linearly convertible for a fixed corruption time, a local Gaussian analysis shows that velocity regression inherits an isotropic target-covariance floor and amplifies low-variance latent directions, while clean prediction damps them. On ImageNet 256 x 256, JLT-B/1 obtains FID-50K 2.50 with classifier-free guidance, with a large matched-target gap over velocity prediction. These results suggest that prediction targets in latent diffusion are representation-dependent geometric choices, rather than interchangeable algebraic parameterizations.