下一代視覺粒度生成

摘要

我们提出了一种新颖的图像生成方法，通过将图像分解为一个结构化序列，其中序列中的每个元素共享相同的空间分辨率，但使用的唯一标记数量不同，从而捕捉不同层次的视觉粒度。图像生成通过我们新引入的“下一视觉粒度”（Next Visual Granularity, NVG）生成框架进行，该框架从空图像开始生成视觉粒度序列，并以结构化方式逐步细化，从全局布局到细节，层层递进。这一迭代过程编码了一种分层的、多层次的表示，提供了跨多个粒度级别对生成过程的精细控制。我们在ImageNet数据集上训练了一系列用于类别条件图像生成的NVG模型，并观察到明显的扩展行为。与VAR系列相比，NVG在FID得分上持续表现更优（3.30 -> 3.03, 2.57 ->2.44, 2.09 -> 2.06）。我们还进行了广泛的分析，以展示NVG框架的能力和潜力。我们的代码和模型将公开发布。

English

We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity. Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels. We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 -> 3.03, 2.57 ->2.44, 2.09 -> 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models will be released.

下一代視覺粒度生成

Next Visual Granularity Generation

摘要

Support