SNCE: スケーラブルな離散画像生成のための幾何学を考慮した教師信号

要旨

近年の離散画像生成の進展により、VQコードブックのサイズを拡大することで再構成忠実度が大幅に向上することが示されてきた。しかし、大規模なVQコードブックを用いた生成モデルの学習は依然として困難であり、通常、より大きなモデルサイズとより長い学習スケジュールを必要とする。本研究では、大規模コードブックを用いた離散画像生成器の最適化の課題に取り組むために設計された新しい学習目的関数である、確率的近傍交差エントロピー最小化（SNCE）を提案する。SNCEは、ハードなone-hotターゲットでモデルを指導する代わりに、近傍トークンの集合に対するソフトなカテゴリカル分布を構築する。各トークンに割り当てられる確率は、そのコード埋め込みと正解画像の埋め込みとの近接度に比例し、量子化された埋め込み空間において意味的に有意義な幾何学的構造をモデルに捕捉させることを促す。クラス条件付きImageNet-256生成、大規模テキストから画像への合成、画像編集タスクにわたる広範な実験を行った。結果は、SNCEが標準的な交差エントロピー目的関数と比較して、収束速度と全体的な生成品質を大幅に改善することを示している。

English

Recent advancements in discrete image generation showed that scaling the VQ codebook size significantly improves reconstruction fidelity. However, training generative models with a large VQ codebook remains challenging, typically requiring larger model size and a longer training schedule. In this work, we propose Stochastic Neighbor Cross Entropy Minimization (SNCE), a novel training objective designed to address the optimization challenges of large-codebook discrete image generators. Instead of supervising the model with a hard one-hot target, SNCE constructs a soft categorical distribution over a set of neighboring tokens. The probability assigned to each token is proportional to the proximity between its code embedding and the ground-truth image embedding, encouraging the model to capture semantically meaningful geometric structure in the quantized embedding space. We conduct extensive experiments across class-conditional ImageNet-256 generation, large-scale text-to-image synthesis, and image editing tasks. Results show that SNCE significantly improves convergence speed and overall generation quality compared to standard cross-entropy objectives.

SNCE: スケーラブルな離散画像生成のための幾何学を考慮した教師信号

SNCE: Geometry-Aware Supervision for Scalable Discrete Image Generation

要旨

Support