확산 모델을 위한 기하학적 오토인코더

초록

잠재 디퓨전 모델은 고해상도 시각 생성 분야에서 새로운 최첨단 기술을 확립했습니다. 비전 파운데이션 모델 사전 지식을 통합하면 생성 효율성이 향상되지만, 기존 잠재 공간 설계는 대부분 경험적 방법에 머물러 있습니다. 이러한 접근법은 의미론적 식별성, 재구성 정확도, 잠재 공간 간소화를 통합하는 데 어려움을 겪는 경우가 많습니다. 본 논문에서는 이러한 과제를 체계적으로 해결하는 원리 기반 프레임워크인 기하학적 오토인코더(GAE)를 제안합니다. 다양한 정렬 패러다임을 분석함으로써 GAE는 VFM에서 최적화된 저차원 의미론적 감독 목표를 구성하여 오토인코더에 지침을 제공합니다. 더 나아가 우리는 표준 VAE의 제한적인 KL-발산을 대체하는 잠재 정규화를 활용하여 디퓨전 학습에 특화된 더 안정적인 잠재 다양체를 가능하게 합니다. 고강도 노이즈 하에서도 견고한 재구성을 보장하기 위해 GAE는 동적 노이즈 샘플링 메커니즘을 도입했습니다. 실험적으로 GAE는 ImageNet-1K 256x256 벤치마크에서 Classifier-Free Guidance 없이 80 에포크에서 1.82, 800 에포크에서 1.31의 gFID를 달성하며 기존 최신 방법을 크게 능가하는 성능을 보였습니다. 생성 품질을 넘어 GAE는 압축률, 의미론적 깊이, 견고한 재구성 안정성 사이에서 우수한 균형을 확립합니다. 이러한 결과는 우리의 설계 고려 사항을 검증하며 잠재 디퓨전 모델링에 유망한 패러다임을 제시합니다. 코드와 모델은 https://github.com/freezing-index/Geometric-Autoencoder-for-Diffusion-Models에서 공개되었습니다.

English

Latent diffusion models have established a new state-of-the-art in high-resolution visual generation. Integrating Vision Foundation Model priors improves generative efficiency, yet existing latent designs remain largely heuristic. These approaches often struggle to unify semantic discriminability, reconstruction fidelity, and latent compactness. In this paper, we propose Geometric Autoencoder (GAE), a principled framework that systematically addresses these challenges. By analyzing various alignment paradigms, GAE constructs an optimized low-dimensional semantic supervision target from VFMs to provide guidance for the autoencoder. Furthermore, we leverage latent normalization that replaces the restrictive KL-divergence of standard VAEs, enabling a more stable latent manifold specifically optimized for diffusion learning. To ensure robust reconstruction under high-intensity noise, GAE incorporates a dynamic noise sampling mechanism. Empirically, GAE achieves compelling performance on the ImageNet-1K 256 times 256 benchmark, reaching a gFID of 1.82 at only 80 epochs and 1.31 at 800 epochs without Classifier-Free Guidance, significantly surpassing existing state-of-the-art methods. Beyond generative quality, GAE establishes a superior equilibrium between compression, semantic depth and robust reconstruction stability. These results validate our design considerations, offering a promising paradigm for latent diffusion modeling. Code and models are publicly available at https://github.com/freezing-index/Geometric-Autoencoder-for-Diffusion-Models.

확산 모델을 위한 기하학적 오토인코더

Geometric Autoencoder for Diffusion Models

초록

Support