Apprendimento sulla Varietà: Sbloccare i Transformer di Diffusione Standard con Encoder di Rappresentazione

Abstract

Lo sfruttamento di encoder di rappresentazione per la modellazione generativa offre un percorso per una sintesi efficiente e ad alta fedeltà. Tuttavia, i transformer diffusion standard non riescono a convergere direttamente su queste rappresentazioni. Mentre lavori recenti attribuiscono il problema a un collo di bottiglia di capacità, proponendo un ridimensionamento computazionalmente costoso in ampiezza dei transformer diffusion, noi dimostriamo che l'insuccesso è fondamentalmente geometrico. Identifichiamo l'Interferenza Geometrica come la causa principale: il flusso euclideo standard forza i percorsi di probabilità attraverso la regione interna a bassa densità dello spazio delle feature ipersferico degli encoder di rappresentazione, invece di seguire la superficie della varietà. Per risolvere ciò, proponiamo il Riemannian Flow Matching con Regolarizzazione di Jacobi (RJF). Vincolando il processo generativo alle geodetiche della varietà e correggendo la propagazione dell'errore indotta dalla curvatura, RJF consente alle architetture standard dei Diffusion Transformer di convergere senza ridimensionamento in ampiezza. Il nostro metodo RJF permette all'architettura standard DiT-B (131M parametri) di convergere efficacemente, raggiungendo un FID di 3.37 laddove i metodi precedenti falliscono nella convergenza. Codice: https://github.com/amandpkr/RJF

English

Leveraging representation encoders for generative modeling offers a path for efficient, high-fidelity synthesis. However, standard diffusion transformers fail to converge on these representations directly. While recent work attributes this to a capacity bottleneck proposing computationally expensive width scaling of diffusion transformers we demonstrate that the failure is fundamentally geometric. We identify Geometric Interference as the root cause: standard Euclidean flow matching forces probability paths through the low-density interior of the hyperspherical feature space of representation encoders, rather than following the manifold surface. To resolve this, we propose Riemannian Flow Matching with Jacobi Regularization (RJF). By constraining the generative process to the manifold geodesics and correcting for curvature-induced error propagation, RJF enables standard Diffusion Transformer architectures to converge without width scaling. Our method RJF enables the standard DiT-B architecture (131M parameters) to converge effectively, achieving an FID of 3.37 where prior methods fail to converge. Code: https://github.com/amandpkr/RJF

Apprendimento sulla Varietà: Sbloccare i Transformer di Diffusione Standard con Encoder di Rappresentazione

Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders

Abstract

Support