基于球面潜在编码器的高效图像合成
Efficient Image Synthesis with Sphere Latent Encoder
May 15, 2026
作者: Tung Do, Thuan Hoang Nguyen, Hao Li
cs.AI
摘要
少步图像生成近期取得了快速进展,基于一致性和均值流的方法显著减少了采样步数。尽管推理成本较低,但这些方法常面临训练不稳定和可扩展性有限的问题。Sphere Encoder是近期提出的替代方案,能够在仅需几步的情况下生成高质量图像;然而,该模型在推理过程中需要在像素空间和潜在空间之间反复切换,同时在同一架构内联合优化重建与生成。这种设计导致计算效率低下,且重建与生成目标之间存在冲突。为解决这些局限,我们将框架解耦为固定的预训练图像编码器和独立的潜在去噪模型,后者完全在球形潜在空间中进行训练。我们的方法消除了训练和推理过程中反复的像素空间操作,提升了效率,使重建与生成能够各自独立专精。在Animal-Faces、Oxford-Flowers和ImageNet-1K数据集上,本方法在生成质量和推理速度上均显著优于Sphere Encoder,同时在与强基线少步及多步模型的对比中取得了具有竞争力的结果。
English
Few-step image generation has seen rapid progress, with consistency and meanflow-based methods significantly reducing the number of sampling steps. Despite their low inference cost, these approaches often suffer from training instability and limited scalability. Sphere Encoder is a recent alternative that produces high-quality images in only a few steps; however, it requires repeated transitions between the pixel space and latent space during inference while jointly optimizing reconstruction and generation within a single architecture. This design leads to computational inefficiency and objective conflict between reconstruction and generation. To address these limitations, we decouple the framework into a fixed pretrained image encoder and a separate latent denoising model trained entirely in a spherical latent space. Our approach eliminates repeated pixel-space operations during training and inference, improving efficiency and allowing reconstruction and generation to specialize independently. On Animal-Faces, Oxford-Flowers and ImageNet-1K datasets, our method significantly outperforms Sphere Encoder in both generation quality and inference speed, while achieving competitive results against strong few-step and multi-step baselines.