ChatPaper.aiChatPaper

视觉自回归生成正确实现:解决异步策略冲突问题

VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

January 5, 2026
作者: Shikun Sun, Liao Qu, Huichao Zhang, Yiheng Liu, Yangyang Song, Xian Li, Xu Wang, Yi Jiang, Daniel K. Du, Xinglong Wu, Jia Jia
cs.AI

摘要

视觉生成领域主要由三大范式主导:自回归模型、扩散模型以及视觉自回归模型。与自回归和扩散模型不同,VAR模型在生成过程中处理异构输入结构,这导致了严重的异步策略冲突。该问题在强化学习场景中尤为突出,易引发训练不稳定与目标对齐欠佳。为解决此问题,我们提出一种创新框架,通过显式管理这些冲突来增强分组相对策略优化。该方法融合了三个协同组件:1)用于引导早期生成阶段的稳定化中间奖励;2)实现精确信用分配的动态时间步重加权机制;3)基于奖励反馈学习原理设计的新型掩码传播算法,可在空间与时间维度同步隔离优化效应。实验表明,相较于原始GRPO基线,我们的方法在生成样本质量与目标对齐度上均取得显著提升,为VAR模型实现了稳健高效的优化。
English
Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.
PDF281January 7, 2026