使用球面潛在編碼器的高效圖像合成

摘要

少步图像生成近期发展迅速，其中基于一致性和均值流的方法显著减少了采样步数。尽管推理成本低，但这些方法常面临训练不稳定与可扩展性有限的问题。Sphere Encoder 作为近期提出的替代方案，能在仅需数步的条件下生成高质量图像；然而，该方法在推理时需在像素空间与潜在空间之间反复切换，并在单一架构中联合优化重建与生成任务。这一设计导致计算效率低下，且重建与生成之间存在目标冲突。为解决这些局限，我们将框架解耦为固定的预训练图像编码器与独立的潜在去噪模型，后者完全在球形潜在空间中进行训练。本方法在训练和推理过程中消除了重复的像素空间操作，提升了效率，并使重建与生成能够各自独立专精。在 Animal-Faces、Oxford-Flowers 和 ImageNet-1K 数据集上，我们的方法在生成质量与推理速度上均显著优于 Sphere Encoder，同时与强大的少步与多步基线模型相比也取得了具有竞争力的结果。

English

Few-step image generation has seen rapid progress, with consistency and meanflow-based methods significantly reducing the number of sampling steps. Despite their low inference cost, these approaches often suffer from training instability and limited scalability. Sphere Encoder is a recent alternative that produces high-quality images in only a few steps; however, it requires repeated transitions between the pixel space and latent space during inference while jointly optimizing reconstruction and generation within a single architecture. This design leads to computational inefficiency and objective conflict between reconstruction and generation. To address these limitations, we decouple the framework into a fixed pretrained image encoder and a separate latent denoising model trained entirely in a spherical latent space. Our approach eliminates repeated pixel-space operations during training and inference, improving efficiency and allowing reconstruction and generation to specialize independently. On Animal-Faces, Oxford-Flowers and ImageNet-1K datasets, our method significantly outperforms Sphere Encoder in both generation quality and inference speed, while achieving competitive results against strong few-step and multi-step baselines.