基于球面潜在编码器的高效图像合成

摘要

少步图像生成近期取得了快速进展，基于一致性和均值流的方法显著减少了采样步数。尽管推理成本较低，但这些方法常面临训练不稳定和可扩展性有限的问题。Sphere Encoder是近期提出的替代方案，能够在仅需几步的情况下生成高质量图像；然而，该模型在推理过程中需要在像素空间和潜在空间之间反复切换，同时在同一架构内联合优化重建与生成。这种设计导致计算效率低下，且重建与生成目标之间存在冲突。为解决这些局限，我们将框架解耦为固定的预训练图像编码器和独立的潜在去噪模型，后者完全在球形潜在空间中进行训练。我们的方法消除了训练和推理过程中反复的像素空间操作，提升了效率，使重建与生成能够各自独立专精。在Animal-Faces、Oxford-Flowers和ImageNet-1K数据集上，本方法在生成质量和推理速度上均显著优于Sphere Encoder，同时在与强基线少步及多步模型的对比中取得了具有竞争力的结果。

English

Few-step image generation has seen rapid progress, with consistency and meanflow-based methods significantly reducing the number of sampling steps. Despite their low inference cost, these approaches often suffer from training instability and limited scalability. Sphere Encoder is a recent alternative that produces high-quality images in only a few steps; however, it requires repeated transitions between the pixel space and latent space during inference while jointly optimizing reconstruction and generation within a single architecture. This design leads to computational inefficiency and objective conflict between reconstruction and generation. To address these limitations, we decouple the framework into a fixed pretrained image encoder and a separate latent denoising model trained entirely in a spherical latent space. Our approach eliminates repeated pixel-space operations during training and inference, improving efficiency and allowing reconstruction and generation to specialize independently. On Animal-Faces, Oxford-Flowers and ImageNet-1K datasets, our method significantly outperforms Sphere Encoder in both generation quality and inference speed, while achieving competitive results against strong few-step and multi-step baselines.