구체 잠재 인코더를 이용한 효율적인 이미지 합성

초록

소수의 단계로 이미지를 생성하는 기술은 급속도로 발전해 왔으며, 특히 consistency 및 meanflow 기반 방법들은 샘플링 단계 수를 크게 줄이는 데 성공했다. 이러한 접근법은 추론 비용이 낮다는 장점에도 불구하고, 훈련 안정성이 부족하고 확장성에 한계를 보이는 경우가 많다. Sphere Encoder는 최근 등장한 대안으로, 소수의 단계만으로도 고품질 이미지를 생성하지만, 추론 과정에서 픽셀 공간과 잠재 공간 간의 반복적인 전환이 필요할 뿐만 아니라, 단일 아키텍처 내에서 재구성과 생성을 동시에 최적화해야 한다. 이러한 설계는 계산 비효율성을 초래하고, 재구성과 생성 목표 간의 충돌을 야기한다. 이러한 한계를 극복하기 위해, 우리는 프레임워크를 고정된 사전 학습 이미지 인코더와 구면 잠재 공간에서 완전히 학습된 별도의 잠재 변환 잡음 제거 모델로 분리한다. 우리의 접근법은 훈련 및 추론 과정에서 반복적인 픽셀 공간 연산을 제거하여 효율성을 향상시키고, 재구성과 생성이 각각 독립적으로 특화될 수 있도록 한다. Animal-Faces, Oxford-Flowers, ImageNet-1K 데이터셋에서 우리의 방법은 생성 품질과 추론 속도 모두에서 Sphere Encoder를 크게 능가하며, 강력한 소수 단계 및 다수 단계 기준선과 비교하여 경쟁력 있는 결과를 달성한다.

English

Few-step image generation has seen rapid progress, with consistency and meanflow-based methods significantly reducing the number of sampling steps. Despite their low inference cost, these approaches often suffer from training instability and limited scalability. Sphere Encoder is a recent alternative that produces high-quality images in only a few steps; however, it requires repeated transitions between the pixel space and latent space during inference while jointly optimizing reconstruction and generation within a single architecture. This design leads to computational inefficiency and objective conflict between reconstruction and generation. To address these limitations, we decouple the framework into a fixed pretrained image encoder and a separate latent denoising model trained entirely in a spherical latent space. Our approach eliminates repeated pixel-space operations during training and inference, improving efficiency and allowing reconstruction and generation to specialize independently. On Animal-Faces, Oxford-Flowers and ImageNet-1K datasets, our method significantly outperforms Sphere Encoder in both generation quality and inference speed, while achieving competitive results against strong few-step and multi-step baselines.