信任你的评判者:基于稳健奖励建模与强化学习的忠实图像编辑与生成
Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation
March 12, 2026
作者: Xiangyu Zhao, Peiyuan Zhang, Junming Lin, Tianhao Liang, Yuchen Duan, Shengyuan Ding, Changyao Tian, Yuhang Zang, Junchi Yan, Xue Yang
cs.AI
摘要
强化学习(RL)已成为提升图像编辑和文生图(T2I)生成能力的重要范式。然而,当前在强化学习中充当评判者的奖励模型常因产生幻觉而给出噪声评分,从而误导优化过程。本文提出FIRM(忠实图像奖励建模)框架,通过构建稳健的奖励模型为忠实图像生成与编辑提供精准可靠的指导。首先,我们设计了定制化的数据构建流程来创建高质量评分数据集:针对编辑任务采用执行度与一致性双重评估,而生成任务主要依据指令遵循度进行评判。基于此,我们收集了FIRM-Edit-370K和FIRM-Gen-293K数据集,并训练出能精准反映这些标准的专用奖励模型(FIRM-Edit-8B和FIRM-Gen-8B)。其次,我们推出专为编辑与生成评判设计的综合基准FIRM-Bench。评估表明,相较于现有指标,我们的模型与人类判断具有更高一致性。为进一步将评判机制无缝融入强化学习流程,我们提出创新的"基础-加成"奖励策略:针对编辑任务的"一致性调节执行度"(CME)和面向生成任务的"质量调节对齐度"(QMA)。基于该框架开发的FIRM-Qwen-Edit和FIRM-SD3.5模型实现了显著性能突破。综合实验表明,FIRM能有效抑制幻觉现象,在忠实度与指令遵循度方面为通用模型树立了新标杆。我们的全部数据集、模型及代码均已公开于https://firm-reward.github.io。
English
Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel "Base-and-Bonus" reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at https://firm-reward.github.io.