ChatPaper.aiChatPaper

JLT:潛在擴散變換器中的乾淨潛在預測

JLT: Clean-Latent Prediction in Latent Diffusion Transformers

May 26, 2026
作者: Funing Fu, Tenghui Wang, Junyong Cen, Qichao Zhu, Guanyu Zhou
cs.AI

摘要

以乾淨資料預測進行流匹配已展現出,相較於預測環境噪聲量,回歸乾淨點更能有效利用低維度結構。我們探討此原則在影像映射至學習後的潛在空間(其中壓縮已去除大量原始像素變異性)後是否仍具效用。我們引入JLT,這是一個基於凍結FLUX.2 VAE編碼的1.3億參數潛在擴散Transformer,並在同一表徵、骨幹網路及訓練設定下,比較乾淨潛在預測與匹配的速度預測DiT。儘管三個變數x、ε及v在固定破壞時間下可線性轉換,但局部高斯分析顯示,速度回歸會繼承各向同性目標共變異數下限,並放大低變異潛在方向,而乾淨預測則抑制之。在ImageNet 256×256上,JLT-B/1使用無分類器引導獲得FID-50K 2.50,並與速度預測存在顯著的匹配目標差距。這些結果表明,潛在擴散中的預測目標是依表徵而定的幾何選擇,而非可互換的代數參數化。
English
Flow matching with clean-data prediction has shown that regressing the clean point can exploit low-dimensional structure more effectively than predicting an ambient noised quantity. We ask whether this principle remains useful after images are mapped into a learned latent space, where compression has already removed much of the raw pixel variability. We introduce JLT, a 130M latent diffusion Transformer over frozen FLUX.2 VAE codes, and compare clean-latent prediction with a matched velocity-prediction DiT under the same representation, backbone, and training settings. Although the three variables x, epsilon, and v are linearly convertible for a fixed corruption time, a local Gaussian analysis shows that velocity regression inherits an isotropic target-covariance floor and amplifies low-variance latent directions, while clean prediction damps them. On ImageNet 256 x 256, JLT-B/1 obtains FID-50K 2.50 with classifier-free guidance, with a large matched-target gap over velocity prediction. These results suggest that prediction targets in latent diffusion are representation-dependent geometric choices, rather than interchangeable algebraic parameterizations.