Synthèse d'images efficace avec encodeur latent sphérique

Résumé

La génération d'images en quelques étapes a connu des progrès rapides, les méthodes basées sur la cohérence et le flux moyen réduisant considérablement le nombre d'étapes d'échantillonnage. Malgré leur faible coût d'inférence, ces approches souffrent souvent d'une instabilité d'entraînement et d'une évolutivité limitée. L'encodeur sphérique (Sphere Encoder) est une alternative récente qui produit des images de haute qualité en seulement quelques étapes ; cependant, il nécessite des transitions répétées entre l'espace des pixels et l'espace latent lors de l'inférence, tout en optimisant conjointement la reconstruction et la génération au sein d'une seule architecture. Cette conception entraîne une inefficacité computationnelle et un conflit d'objectifs entre reconstruction et génération. Pour remédier à ces limitations, nous découplons le cadre en un encodeur d'images pré-entraîné fixe et un modèle de débruitage latent séparé, entraîné entièrement dans un espace latent sphérique. Notre approche élimine les opérations répétées dans l'espace des pixels pendant l'entraînement et l'inférence, améliorant ainsi l'efficacité et permettant à la reconstruction et à la génération de se spécialiser indépendamment. Sur les ensembles de données Animal-Faces, Oxford-Flowers et ImageNet-1K, notre méthode surpasse significativement l'encodeur sphérique tant en qualité de génération qu'en vitesse d'inférence, tout en obtenant des résultats compétitifs face à des références solides en quelques étapes et en plusieurs étapes.

English

Few-step image generation has seen rapid progress, with consistency and meanflow-based methods significantly reducing the number of sampling steps. Despite their low inference cost, these approaches often suffer from training instability and limited scalability. Sphere Encoder is a recent alternative that produces high-quality images in only a few steps; however, it requires repeated transitions between the pixel space and latent space during inference while jointly optimizing reconstruction and generation within a single architecture. This design leads to computational inefficiency and objective conflict between reconstruction and generation. To address these limitations, we decouple the framework into a fixed pretrained image encoder and a separate latent denoising model trained entirely in a spherical latent space. Our approach eliminates repeated pixel-space operations during training and inference, improving efficiency and allowing reconstruction and generation to specialize independently. On Animal-Faces, Oxford-Flowers and ImageNet-1K datasets, our method significantly outperforms Sphere Encoder in both generation quality and inference speed, while achieving competitive results against strong few-step and multi-step baselines.