JLT: Schone-latente voorspelling in latente diffusietransformers

Samenvatting

Flow matching met schone-data-voorspelling heeft aangetoond dat het regresseren op het schone punt de laagdimensionale structuur effectiever kan benutten dan het voorspellen van een omgevingsruisgrootheid. We vragen ons af of dit principe nuttig blijft nadat afbeeldingen zijn afgebeeld in een aangeleerde latente ruimte, waar compressie al veel van de ruwe pixelvariabiliteit heeft verwijderd. We introduceren JLT, een 130M latente diffusie Transformer over bevroren FLUX.2 VAE-codes, en vergelijken schone-latente voorspelling met een gematchte snelheidsvoorspellings-DiT onder dezelfde representatie, backbone en trainingsinstellingen. Hoewel de drie variabelen x, epsilon en v lineair converteerbaar zijn voor een vaste corruptietijd, toont een lokale Gauss-analyse aan dat snelheidsregressie een isotrope doelcovariantievloer erft en laag-variantie latente richtingen versterkt, terwijl schone voorspelling deze dempt. Op ImageNet 256 x 256 behaalt JLT-B/1 FID-50K 2.50 met classifier-vrije begeleiding, met een grote gematchte-doelkloof ten opzichte van snelheidsvoorspelling. Deze resultaten suggereren dat voorspellingsdoelen in latente diffusie representatie-afhankelijke geometrische keuzes zijn, in plaats van uitwisselbare algebraïsche parametriseringen.

English

Flow matching with clean-data prediction has shown that regressing the clean point can exploit low-dimensional structure more effectively than predicting an ambient noised quantity. We ask whether this principle remains useful after images are mapped into a learned latent space, where compression has already removed much of the raw pixel variability. We introduce JLT, a 130M latent diffusion Transformer over frozen FLUX.2 VAE codes, and compare clean-latent prediction with a matched velocity-prediction DiT under the same representation, backbone, and training settings. Although the three variables x, epsilon, and v are linearly convertible for a fixed corruption time, a local Gaussian analysis shows that velocity regression inherits an isotropic target-covariance floor and amplifies low-variance latent directions, while clean prediction damps them. On ImageNet 256 x 256, JLT-B/1 obtains FID-50K 2.50 with classifier-free guidance, with a large matched-target gap over velocity prediction. These results suggest that prediction targets in latent diffusion are representation-dependent geometric choices, rather than interchangeable algebraic parameterizations.