下一代視覺粒度生成
Next Visual Granularity Generation
August 18, 2025
作者: Yikai Wang, Zhouxia Wang, Zhonghua Wu, Qingyi Tao, Kang Liao, Chen Change Loy
cs.AI
摘要
我们提出了一种新颖的图像生成方法,通过将图像分解为一个结构化序列,其中序列中的每个元素共享相同的空间分辨率,但使用的唯一标记数量不同,从而捕捉不同层次的视觉粒度。图像生成通过我们新引入的“下一视觉粒度”(Next Visual Granularity, NVG)生成框架进行,该框架从空图像开始生成视觉粒度序列,并以结构化方式逐步细化,从全局布局到细节,层层递进。这一迭代过程编码了一种分层的、多层次的表示,提供了跨多个粒度级别对生成过程的精细控制。我们在ImageNet数据集上训练了一系列用于类别条件图像生成的NVG模型,并观察到明显的扩展行为。与VAR系列相比,NVG在FID得分上持续表现更优(3.30 -> 3.03, 2.57 ->2.44, 2.09 -> 2.06)。我们还进行了广泛的分析,以展示NVG框架的能力和潜力。我们的代码和模型将公开发布。
English
We propose a novel approach to image generation by decomposing an image into
a structured sequence, where each element in the sequence shares the same
spatial resolution but differs in the number of unique tokens used, capturing
different level of visual granularity. Image generation is carried out through
our newly introduced Next Visual Granularity (NVG) generation framework, which
generates a visual granularity sequence beginning from an empty image and
progressively refines it, from global layout to fine details, in a structured
manner. This iterative process encodes a hierarchical, layered representation
that offers fine-grained control over the generation process across multiple
granularity levels. We train a series of NVG models for class-conditional image
generation on the ImageNet dataset and observe clear scaling behavior. Compared
to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30
-> 3.03, 2.57 ->2.44, 2.09 -> 2.06). We also conduct extensive analysis to
showcase the capability and potential of the NVG framework. Our code and models
will be released.