生成する前に理解する：自己誘導型トレーニングによる自己回帰的画像生成

要旨

近年の研究では、画像生成における高品質な視覚表現の重要性が示され、画像理解における生成モデルの限界が指摘されています。自然言語向けに設計された生成パラダイムである自己回帰モデルも、同様の課題に直面しています。本研究では、次トークン予測パラダイムを視覚領域に適用するメカニズムについて、初めて体系的な調査を行います。我々は、高レベルな視覚意味論の学習を妨げる3つの主要な特性を特定しました：局所的かつ条件的依存性、ステップ間の意味的不整合、そして空間不変性の欠如です。これらの問題は、トレーニング中に自己教師あり目標を導入することで効果的に解決できることを示し、新しいトレーニングフレームワークである「自己回帰モデルのための自己誘導型トレーニング（ST-AR）」を提案します。事前学習済み表現モデルに依存せず、ST-ARは自己回帰モデルの画像理解能力を大幅に向上させ、生成品質の改善をもたらします。具体的には、ST-ARはLlamaGen-Lで約42%、LlamaGen-XLで約49%のFID改善をもたらし、同じサンプリング戦略を維持します。

English

Recent studies have demonstrated the importance of high-quality visual representations in image generation and have highlighted the limitations of generative models in image understanding. As a generative paradigm originally designed for natural language, autoregressive models face similar challenges. In this work, we present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain. We identify three key properties that hinder the learning of high-level visual semantics: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency. We show that these issues can be effectively addressed by introducing self-supervised objectives during training, leading to a novel training framework, Self-guided Training for AutoRegressive models (ST-AR). Without relying on pre-trained representation models, ST-AR significantly enhances the image understanding ability of autoregressive models and leads to improved generation quality. Specifically, ST-AR brings approximately 42% FID improvement for LlamaGen-L and 49% FID improvement for LlamaGen-XL, while maintaining the same sampling strategy.

生成する前に理解する：自己誘導型トレーニングによる自己回帰的画像生成

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

要旨

Support