ChatPaper.aiChatPaper

視覺自回歸生成的正確強化學習實踐:解決異步策略衝突問題

VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

January 5, 2026
作者: Shikun Sun, Liao Qu, Huichao Zhang, Yiheng Liu, Yangyang Song, Xian Li, Xu Wang, Yi Jiang, Daniel K. Du, Xinglong Wu, Jia Jia
cs.AI

摘要

視覺生成領域主要由三大範式主導:自回歸模型、擴散模型以及視覺自回歸模型。與自回歸和擴散模型不同,VAR模型在生成步驟中處理異質性輸入結構,這導致嚴重的非同步策略衝突。該問題在強化學習情境下尤為突出,會引發訓練不穩定與對齊效果欠佳。為解決此問題,我們提出一個新穎框架,通過顯式管理這些衝突來增強群組相對策略優化。我們的方法整合了三項協同組件:1)用於引導早期生成階段的穩定化中間獎勵;2)實現精確信度分配的動態時間步重加權機制;3)基於獎勵反饋學習原理設計的新型遮罩傳播算法,可在空間與時間維度上隔離優化效應。相較於原始GRPO基準,我們的方法在生成樣本質量與目標對齊度方面展現顯著提升,為VAR模型實現了強健且高效的優化。
English
Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.
PDF281January 7, 2026