ChatPaper.aiChatPaper

重新审视尺度自回归生成中的训练动态

Rethinking Training Dynamics in Scale-wise Autoregressive Generation

December 6, 2025
作者: Gengze Zhou, Chongjian Ge, Hao Tan, Feng Liu, Yicong Hong
cs.AI

摘要

自回归生成模型的最新进展催生了日益强大的媒体合成系统。其中,多尺度预测已成为流行范式,模型通过由粗到细的方式生成图像。然而,尺度自回归模型存在曝光偏差问题,影响生成质量。我们识别出该问题的两个主要原因:(1) 训练-测试不匹配,即模型在推理过程中必须依赖自身不完美的预测;(2) 尺度学习难度不平衡,某些尺度表现出过高的优化复杂度。通过全面分析训练动态,我们提出自回归优化方法(SAR)来解决这些局限。该方法引入交错尺度展开机制(SSR),通过轻量级自回归展开使模型接触其中间预测,从而对齐训练-测试模式;同时设计互补的对比性强制学习损失函数(CSFL),为自生成内容提供充分监督以确保训练稳定性。实验结果表明,将SAR应用于预训练的自回归模型能以最小计算开销持续提升生成质量。例如在ImageNet 256数据集上,FlexVAR-d16模型经过10轮训练(32xA100 GPU耗时5小时)即可实现FID指标5.2%的降低。基于其高效性、可扩展性和有效性,我们预期SAR能成为视觉自回归生成领域可靠的训练后优化方法。
English
Recent advances in autoregressive (AR) generative models have produced increasingly powerful systems for media synthesis. Among them, next-scale prediction has emerged as a popular paradigm, where models generate images in a coarse-to-fine manner. However, scale-wise AR models suffer from exposure bias, which undermines generation quality. We identify two primary causes of this issue: (1) train-test mismatch, where the model must rely on its own imperfect predictions during inference, and (2) imbalance in scale-wise learning difficulty, where certain scales exhibit disproportionately higher optimization complexity. Through a comprehensive analysis of training dynamics, we propose Self-Autoregressive Refinement (SAR) to address these limitations. SAR introduces a Stagger-Scale Rollout (SSR) mechanism that performs lightweight autoregressive rollouts to expose the model to its own intermediate predictions, thereby aligning train-test patterns, and a complementary Contrastive Student-Forcing Loss (CSFL) that provides adequate supervision for self-generated contexts to ensure stable training. Experimental results show that applying SAR to pretrained AR models consistently improves generation quality with minimal computational overhead. For instance, SAR yields a 5.2% FID reduction on FlexVAR-d16 trained on ImageNet 256 within 10 epochs (5 hours on 32xA100 GPUs). Given its efficiency, scalability, and effectiveness, we expect SAR to serve as a reliable post-training method for visual autoregressive generation.
PDF32December 10, 2025