ChatPaper.aiChatPaper

信任你的評判者:實現忠實圖像編輯與生成的穩健獎勵建模及強化學習

Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

March 12, 2026
作者: Xiangyu Zhao, Peiyuan Zhang, Junming Lin, Tianhao Liang, Yuchen Duan, Shengyuan Ding, Changyao Tian, Yuhang Zang, Junchi Yan, Xue Yang
cs.AI

摘要

強化學習(RL)已成為提升影像編輯與文字轉圖像(T2I)生成技術的潛力範式。然而,當前在強化學習中擔任評判角色的獎勵模型常出現幻覺問題,產生雜訊評分,從而誤導優化過程。本文提出FIRM(忠實影像獎勵建模)框架,透過建構穩健的獎勵模型,為忠實的影像生成與編輯提供精確可靠的指導。首先,我們設計專屬資料篩選流程來建構高品質評分資料集:針對編輯任務同時評估執行效果與一致性,而生成任務則主要檢視指令遵循程度。據此我們收集了FIRM-Edit-370K與FIRM-Gen-293K資料集,並訓練出能精準反映這些標準的專用獎勵模型(FIRM-Edit-8B與FIRM-Gen-8B)。其次,我們推出專為編輯與生成評判設計的綜合基準測試FIRM-Bench。實驗顯示,相比現有指標,我們的模型與人類判斷具有更優異的一致性。為將這些評判機制無縫整合至強化學習流程,我們提出創新的「基礎加獎勵」策略來平衡競爭目標:針對編輯任務的「一致性調控執行」(CME)與生成任務的「品質調控對齊」(QMA)。在此框架支持下,我們開發的FIRM-Qwen-Edit與FIRM-SD3.5模型實現顯著性能突破。全面實驗證實,FIRM能有效抑制幻覺現象,在保真度與指令遵循度上為現有通用模型設立新標竿。所有資料集、模型與程式碼均已公開於https://firm-reward.github.io。
English
Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel "Base-and-Bonus" reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at https://firm-reward.github.io.
PDF222March 15, 2026