GARDO:無需獎勵破解的強化擴散模型
GARDO: Reinforcing Diffusion Models without Reward Hacking
December 30, 2025
作者: Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang, Ziyang Yuan, Xintao Wang, Hangyu Mao, Pengfei Wan, Ling Pan
cs.AI
摘要
基於線上強化學習的擴散模型微調已展現出提升文字-影像對齊能力的巨大潛力。然而,由於視覺任務中精確定義真實目標仍具挑戰性,模型通常僅能透過部分反映真實目標的代理獎勵進行優化。這種不匹配常導致獎勵破解現象——代理分數上升的同時,真實影像品質惡化且生成多樣性崩潰。現有解決方案通常透過對參考策略添加正則化來防止獎勵破解,但這會犧牲取樣效率並阻礙對新高獎勵區域的探索,因為參考策略往往非最優。為平衡取樣效率、有效探索與獎勵破解緩解這三重需求,我們提出具門控自適應正則化的多樣性感知優化框架(GARDO),該通用框架可兼容各類強化學習算法。我們的核心洞見在於:正則化無需全域應用,選擇性懲罰高不確定性樣本子集反而能顯著提升效能。針對探索難題,GARDO引入自適應正則化機制,定期更新參考模型以匹配線上策略能力,確保正則化目標的相關性。為解決強化學習中的模態崩潰問題,GARDO對兼具高品質與高多樣性的樣本實施獎勵放大,在不破壞優化穩定性的前提下激發模態覆蓋。跨多種代理獎勵與保留未見指標的大規模實驗表明,GARDO能有效緩解獎勵破解並提升生成多樣性,且無需犧牲取樣效率或探索能力,彰顯其效能與魯棒性。
English
Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment. However, since precisely specifying a ground-truth objective for visual tasks remains challenging, the models are often optimized using a proxy reward that only partially captures the true goal. This mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses. While common solutions add regularization against the reference policy to prevent reward hacking, they compromise sample efficiency and impede the exploration of novel, high-reward regions, as the reference policy is usually sub-optimal. To address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking, we propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO), a versatile framework compatible with various RL algorithms. Our key insight is that regularization need not be applied universally; instead, it is highly effective to selectively penalize a subset of samples that exhibit high uncertainty. To address the exploration challenge, GARDO introduces an adaptive regularization mechanism wherein the reference model is periodically updated to match the capabilities of the online policy, ensuring a relevant regularization target. To address the mode collapse issue in RL, GARDO amplifies the rewards for high-quality samples that also exhibit high diversity, encouraging mode coverage without destabilizing the optimization process. Extensive experiments across diverse proxy rewards and hold-out unseen metrics consistently show that GARDO mitigates reward hacking and enhances generation diversity without sacrificing sample efficiency or exploration, highlighting its effectiveness and robustness.