在生成之前先理解：自引导训练用于自回归图像生成

摘要

近期研究揭示了高质量視覺表徵在圖像生成中的重要性，並凸顯了生成模型在圖像理解方面的局限性。作為最初為自然語言設計的生成範式，自回歸模型面臨著類似的挑戰。在本研究中，我們首次系統性地探討了將下一個詞預測範式應用於視覺領域的機制。我們識別出阻礙高層次視覺語義學習的三個關鍵特性：局部與條件依賴性、步驟間語義不一致性以及空間不變性缺失。我們證明，通過在訓練過程中引入自監督目標，這些問題可以得到有效解決，從而提出了一種新穎的訓練框架——自回歸模型的自引導訓練（ST-AR）。無需依賴預訓練的表徵模型，ST-AR顯著增強了自回歸模型的圖像理解能力，並提升了生成質量。具體而言，在保持相同採樣策略的情況下，ST-AR為LlamaGen-L帶來了約42%的FID提升，為LlamaGen-XL帶來了49%的FID提升。

English

Recent studies have demonstrated the importance of high-quality visual representations in image generation and have highlighted the limitations of generative models in image understanding. As a generative paradigm originally designed for natural language, autoregressive models face similar challenges. In this work, we present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain. We identify three key properties that hinder the learning of high-level visual semantics: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency. We show that these issues can be effectively addressed by introducing self-supervised objectives during training, leading to a novel training framework, Self-guided Training for AutoRegressive models (ST-AR). Without relying on pre-trained representation models, ST-AR significantly enhances the image understanding ability of autoregressive models and leads to improved generation quality. Specifically, ST-AR brings approximately 42% FID improvement for LlamaGen-L and 49% FID improvement for LlamaGen-XL, while maintaining the same sampling strategy.

在生成之前先理解：自引导训练用于自回归图像生成

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

摘要

Support