SNCE: 확장 가능한 이산 이미지 생성을 위한 기하 구조 인식 지도 학습

초록

최근 이산적 이미지 생성 분야의 발전은 VQ 코드북 크기를 확장함으로써 재구성 정확도를 크게 향상시킬 수 있음을 보여주었습니다. 그러나 대규모 VQ 코드북을 사용한 생성 모델 학습은 여전히 어려운 과제로 남아 있으며, 일반적으로 더 큰 모델 규모와 더 긴 학습 기간을 필요로 합니다. 본 연구에서는 대규모 코드북을 사용하는 이산적 이미지 생성기의 최적화 문제를 해결하기 위해 설계된 새로운 학습 목적 함수인 확률적 이웃 교차 엔트로피 최소화(SNCE)를 제안합니다. SNCE는 하드 원-핫 타겟으로 모델을 지도하는 대신, 인접한 토큰 집합에 대한 연속적인 범주 분포를 구성합니다. 각 토큰에 할당된 확률은 해당 코드 임베딩과 실제 이미지 임베딩 간의 근접도에 비례하여, 모델이 양자화된 임베딩 공간에서 의미론적으로 의미 있는 기하학적 구조를 포착하도록 유도합니다. 우리는 클래스 조건부 ImageNet-256 생성, 대규모 텍스트-이미지 합성, 이미지 편집 작업에 걸쳐 광범위한 실험을 수행했습니다. 결과에 따르면 SNCE는 표준 교차 엔트로피 목적 함수에 비해 수렴 속도와 전체 생성 품질을 크게 향상시키는 것으로 나타났습니다.

English

Recent advancements in discrete image generation showed that scaling the VQ codebook size significantly improves reconstruction fidelity. However, training generative models with a large VQ codebook remains challenging, typically requiring larger model size and a longer training schedule. In this work, we propose Stochastic Neighbor Cross Entropy Minimization (SNCE), a novel training objective designed to address the optimization challenges of large-codebook discrete image generators. Instead of supervising the model with a hard one-hot target, SNCE constructs a soft categorical distribution over a set of neighboring tokens. The probability assigned to each token is proportional to the proximity between its code embedding and the ground-truth image embedding, encouraging the model to capture semantically meaningful geometric structure in the quantized embedding space. We conduct extensive experiments across class-conditional ImageNet-256 generation, large-scale text-to-image synthesis, and image editing tasks. Results show that SNCE significantly improves convergence speed and overall generation quality compared to standard cross-entropy objectives.

SNCE: 확장 가능한 이산 이미지 생성을 위한 기하 구조 인식 지도 학습

SNCE: Geometry-Aware Supervision for Scalable Discrete Image Generation

초록

Support