Leren op de Variëteit: Standaard Diffusie-Transformers Ontgrendelen met Representatie-Encoders

Samenvatting

Het gebruik van representatie-encoders voor generatieve modellering biedt een weg naar efficiënte, hoogwaardige synthese. Standaard diffusie-transformers slagen er echter niet in om rechtstreeks op deze representaties te convergeren. Waar recent werk dit toeschrijft aan een capaciteitsknelpunt en rekentechnisch dure schaalvergroting van diffusie-transformers voorstelt, tonen wij aan dat de oorzaak fundamenteel geometrisch is. Wij identificeren 'Geometrische Interferentie' als de grondoorzaak: standaard Euclidische flow matching dwingt waarschijnlijkheidspaden door de lage-dichtheid kern van de hypersferische kenmerkruimte van representatie-encoders, in plaats van het oppervlak van de variëteit te volgen. Om dit op te lossen, stellen wij Riemanniaans Flow Matching met Jacobi Regularisatie (RJF) voor. Door het generatieve proces te beperken tot de geodeten van de variëteit en foutpropagatie door kromming te corrigeren, stelt RJF standaard Diffusion Transformer-architecturen in staat te convergeren zonder schaalvergroting. Onze methode RJF stelt de standaard DiT-B-architectuur (131M parameters) in staat effectief te convergeren, met een FID van 3.37 waar eerdere methoden niet convergeren. Code: https://github.com/amandpkr/RJF

English

Leveraging representation encoders for generative modeling offers a path for efficient, high-fidelity synthesis. However, standard diffusion transformers fail to converge on these representations directly. While recent work attributes this to a capacity bottleneck proposing computationally expensive width scaling of diffusion transformers we demonstrate that the failure is fundamentally geometric. We identify Geometric Interference as the root cause: standard Euclidean flow matching forces probability paths through the low-density interior of the hyperspherical feature space of representation encoders, rather than following the manifold surface. To resolve this, we propose Riemannian Flow Matching with Jacobi Regularization (RJF). By constraining the generative process to the manifold geodesics and correcting for curvature-induced error propagation, RJF enables standard Diffusion Transformer architectures to converge without width scaling. Our method RJF enables the standard DiT-B architecture (131M parameters) to converge effectively, achieving an FID of 3.37 where prior methods fail to converge. Code: https://github.com/amandpkr/RJF

Leren op de Variëteit: Standaard Diffusie-Transformers Ontgrendelen met Representatie-Encoders

Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders

Samenvatting

Support