球面潜在エンコーダを用いた効率的な画像合成

要旨

少数ステップでの画像生成は急速に進歩しており、一致性モデルや平均流ベースの手法によりサンプリングステップ数が大幅に削減されている。これらの手法は推論コストが低い一方で、訓練の不安定性やスケーラビリティの制限に悩まされることが多い。Sphere Encoderは、わずか数ステップで高品質な画像を生成する最近の代替手法であるが、推論中にピクセル空間と潜在空間の間の繰り返し遷移を必要とし、単一アーキテクチャ内で再構成と生成を共同最適化する。この設計は計算効率の低下と、再構成と生成の間の目的の競合を引き起こす。これらの制限に対処するため、我々はフレームワークを、固定された事前学習済み画像エンコーダと、球面潜在空間内で完全に訓練される別個の潜在デノイジングモデルに分離する。本手法は訓練および推論中の繰り返しのピクセル空間操作を排除し、効率を向上させるとともに、再構成と生成が独立して特化することを可能にする。Animal-Faces、Oxford-Flowers、ImageNet-1Kデータセットにおいて、本手法は生成品質と推論速度の両方でSphere Encoderを大幅に上回り、強力な少数ステップおよび多ステップのベースラインに対して競争力のある結果を達成する。

English

Few-step image generation has seen rapid progress, with consistency and meanflow-based methods significantly reducing the number of sampling steps. Despite their low inference cost, these approaches often suffer from training instability and limited scalability. Sphere Encoder is a recent alternative that produces high-quality images in only a few steps; however, it requires repeated transitions between the pixel space and latent space during inference while jointly optimizing reconstruction and generation within a single architecture. This design leads to computational inefficiency and objective conflict between reconstruction and generation. To address these limitations, we decouple the framework into a fixed pretrained image encoder and a separate latent denoising model trained entirely in a spherical latent space. Our approach eliminates repeated pixel-space operations during training and inference, improving efficiency and allowing reconstruction and generation to specialize independently. On Animal-Faces, Oxford-Flowers and ImageNet-1K datasets, our method significantly outperforms Sphere Encoder in both generation quality and inference speed, while achieving competitive results against strong few-step and multi-step baselines.