GARDO:无奖励破解的扩散模型强化方法
GARDO: Reinforcing Diffusion Models without Reward Hacking
December 30, 2025
作者: Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang, Ziyang Yuan, Xintao Wang, Hangyu Mao, Pengfei Wan, Ling Pan
cs.AI
摘要
通过在线强化学习对扩散模型进行微调,在提升文本-图像对齐方面展现出巨大潜力。然而,由于视觉任务中精确指定真实目标仍具挑战性,模型通常使用仅部分反映真实目标的代理奖励进行优化。这种不匹配常导致奖励破解现象——代理分数上升的同时真实图像质量下降,且生成多样性崩溃。现有方案通常通过添加针对参考策略的正则化来防止奖励破解,但由于参考策略通常非最优,这种做法会牺牲样本效率并阻碍对高奖励新区域的探索。为平衡样本效率、有效探索和奖励破解缓解这三重需求,我们提出门控自适应正则化与多样性感知优化框架(GARDO)。该通用框架可兼容多种强化学习算法,其核心思想在于:正则化无需普遍适用,而选择性惩罚高不确定性样本子集效果显著。针对探索难题,GARDO引入自适应正则化机制,定期更新参考模型以匹配在线策略能力,确保正则化目标始终相关。针对强化学习中的模式崩溃问题,GARDO通过放大兼具高质量与高多样性样本的奖励,在保持优化稳定性的同时促进模式覆盖。在多种代理奖励和留出未见指标上的实验表明,GARDO能有效缓解奖励破解并提升生成多样性,且不会牺牲样本效率或探索能力,彰显了其有效性与鲁棒性。
English
Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment. However, since precisely specifying a ground-truth objective for visual tasks remains challenging, the models are often optimized using a proxy reward that only partially captures the true goal. This mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses. While common solutions add regularization against the reference policy to prevent reward hacking, they compromise sample efficiency and impede the exploration of novel, high-reward regions, as the reference policy is usually sub-optimal. To address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking, we propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO), a versatile framework compatible with various RL algorithms. Our key insight is that regularization need not be applied universally; instead, it is highly effective to selectively penalize a subset of samples that exhibit high uncertainty. To address the exploration challenge, GARDO introduces an adaptive regularization mechanism wherein the reference model is periodically updated to match the capabilities of the online policy, ensuring a relevant regularization target. To address the mode collapse issue in RL, GARDO amplifies the rewards for high-quality samples that also exhibit high diversity, encouraging mode coverage without destabilizing the optimization process. Extensive experiments across diverse proxy rewards and hold-out unseen metrics consistently show that GARDO mitigates reward hacking and enhances generation diversity without sacrificing sample efficiency or exploration, highlighting its effectiveness and robustness.