ChatPaper.aiChatPaper

JLT:潜在扩散变换器中的干净潜在预测

JLT: Clean-Latent Prediction in Latent Diffusion Transformers

May 26, 2026
作者: Funing Fu, Tenghui Wang, Junyong Cen, Qichao Zhu, Guanyu Zhou
cs.AI

摘要

流匹配与干净数据预测表明,相比于预测环境噪声量,回归干净点能更有效利用低维结构。我们探究这一原理在图像映射到学习得到的潜空间后是否仍然有效——该空间中的压缩已消除原始像素的大部分变异性。我们提出JLT,这是一个基于冻结FLUX.2 VAE编码的130M潜扩散Transformer,并在相同表征、主干网络和训练设置下,将干净潜变量预测与匹配的速度预测DiT进行对比。尽管对于固定加噪时间而言,x、ε和v这三个变量可线性互转,但局部高斯分析表明,速度回归继承了各向同性的目标协方差下限,并放大了低方差潜方向,而干净预测则抑制了这些方向。在ImageNet 256×256上,JLT-B/1在无分类器引导下获得FID-50K 2.50,且与速度预测相比存在较大的匹配目标差距。这些结果表明,潜扩散中的预测目标是依赖于表征的几何选择,而非可互换的代数参数化方案。
English
Flow matching with clean-data prediction has shown that regressing the clean point can exploit low-dimensional structure more effectively than predicting an ambient noised quantity. We ask whether this principle remains useful after images are mapped into a learned latent space, where compression has already removed much of the raw pixel variability. We introduce JLT, a 130M latent diffusion Transformer over frozen FLUX.2 VAE codes, and compare clean-latent prediction with a matched velocity-prediction DiT under the same representation, backbone, and training settings. Although the three variables x, epsilon, and v are linearly convertible for a fixed corruption time, a local Gaussian analysis shows that velocity regression inherits an isotropic target-covariance floor and amplifies low-variance latent directions, while clean prediction damps them. On ImageNet 256 x 256, JLT-B/1 obtains FID-50K 2.50 with classifier-free guidance, with a large matched-target gap over velocity prediction. These results suggest that prediction targets in latent diffusion are representation-dependent geometric choices, rather than interchangeable algebraic parameterizations.