先理解后生成：自引导训练在自回归图像生成中的应用

摘要

近期研究揭示了高质量视觉表征在图像生成中的重要性，同时指出了生成模型在图像理解方面的局限性。作为最初为自然语言设计的生成范式，自回归模型面临着相似的挑战。在本研究中，我们首次系统性地探讨了将下一标记预测范式应用于视觉领域的机制。我们识别出阻碍高级视觉语义学习的三个关键特性：局部与条件依赖性、步骤间语义不一致性以及空间不变性缺失。研究表明，通过在训练中引入自监督目标，这些问题能够得到有效解决，从而提出了一种新颖的训练框架——自回归模型的自引导训练（ST-AR）。无需依赖预训练的表征模型，ST-AR显著提升了自回归模型的图像理解能力，并带来了生成质量的提升。具体而言，在保持相同采样策略的情况下，ST-AR为LlamaGen-L带来了约42%的FID提升，为LlamaGen-XL带来了49%的FID提升。

English

Recent studies have demonstrated the importance of high-quality visual representations in image generation and have highlighted the limitations of generative models in image understanding. As a generative paradigm originally designed for natural language, autoregressive models face similar challenges. In this work, we present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain. We identify three key properties that hinder the learning of high-level visual semantics: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency. We show that these issues can be effectively addressed by introducing self-supervised objectives during training, leading to a novel training framework, Self-guided Training for AutoRegressive models (ST-AR). Without relying on pre-trained representation models, ST-AR significantly enhances the image understanding ability of autoregressive models and leads to improved generation quality. Specifically, ST-AR brings approximately 42% FID improvement for LlamaGen-L and 49% FID improvement for LlamaGen-XL, while maintaining the same sampling strategy.

先理解后生成：自引导训练在自回归图像生成中的应用

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

摘要

Support