次世代視覚的粒度生成

要旨

本研究では、画像を構造化されたシーケンスに分解する新たなアプローチを提案する。このシーケンスの各要素は同じ空間解像度を共有するが、使用されるユニークなトークンの数が異なり、異なるレベルの視覚的粒度を捉える。画像生成は、新たに導入したNext Visual Granularity（NVG）生成フレームワークを通じて行われる。このフレームワークは、空の画像から始まり、グローバルなレイアウトから細部まで、構造化された方法で視覚的粒度シーケンスを生成し、段階的に洗練させる。この反復プロセスは、階層的で層状の表現をエンコードし、複数の粒度レベルにわたる生成プロセスに対するきめ細かい制御を可能にする。ImageNetデータセットを用いてクラス条件付き画像生成のための一連のNVGモデルを学習し、明確なスケーリング挙動を観察した。VARシリーズと比較すると、NVGはFIDスコアにおいて一貫して優れた性能を示した（3.30 -> 3.03, 2.57 -> 2.44, 2.09 -> 2.06）。また、NVGフレームワークの能力と可能性を示すために、広範な分析を実施した。我々のコードとモデルは公開予定である。

English

We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity. Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels. We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 -> 3.03, 2.57 ->2.44, 2.09 -> 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models will be released.