PaCo-RL:透過配對獎勵建模推進強化學習在一致性圖像生成中的應用
PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling
December 2, 2025
作者: Bowen Ping, Chengyou Jia, Minnan Luo, Changliang Xia, Xin Shen, Zhuohang Dang, Hangwei Qian
cs.AI
摘要
一致影像生成需要跨越多張影像忠實保留身分特徵、風格元素與邏輯連貫性,這對於故事敘事和角色設計等應用至關重要。監督式訓練方法因缺乏大規模視覺一致性數據集,且難以建模人類感知偏好而面臨挑戰。本文提出強化學習(RL)可透過數據無關的方式學習複雜主觀的視覺標準,為此任務提供極具潛力的替代方案。為實現此目標,我們提出PaCo-RL——一個融合專用一致性獎勵模型與高效RL演算法的完整框架。其中首個組件PaCo-Reward是基於自動化子圖配對構建的大規模數據集訓練的配對一致性評估器,透過生成式自迴歸評分機制並結合任務感知指令與思維鏈推理來評估一致性。第二組件PaCo-GRPO採用創新的解析度解耦優化策略大幅降低RL成本,同時透過對數調控的多獎勵聚合機制確保平衡穩定的獎勵優化。在兩項代表性子任務上的大量實驗表明:PaCo-Reward顯著提升了與人類視覺一致性感知的吻合度,而PaCo-GRPO以更優的訓練效率與穩定性實現了頂尖的一致性表現。這些成果共同印證PaCo-RL作為實用可擴展的一致影像生成解決方案的潛力。項目頁面請見:https://x-gengroup.github.io/HomePage_PaCo-RL/。
English
Consistent image generation requires faithfully preserving identities, styles, and logical coherence across multiple images, which is essential for applications such as storytelling and character design. Supervised training approaches struggle with this task due to the lack of large-scale datasets capturing visual consistency and the complexity of modeling human perceptual preferences. In this paper, we argue that reinforcement learning (RL) offers a promising alternative by enabling models to learn complex and subjective visual criteria in a data-free manner. To achieve this, we introduce PaCo-RL, a comprehensive framework that combines a specialized consistency reward model with an efficient RL algorithm. The first component, PaCo-Reward, is a pairwise consistency evaluator trained on a large-scale dataset constructed via automated sub-figure pairing. It evaluates consistency through a generative, autoregressive scoring mechanism enhanced by task-aware instructions and CoT reasons. The second component, PaCo-GRPO, leverages a novel resolution-decoupled optimization strategy to substantially reduce RL cost, alongside a log-tamed multi-reward aggregation mechanism that ensures balanced and stable reward optimization. Extensive experiments across the two representative subtasks show that PaCo-Reward significantly improves alignment with human perceptions of visual consistency, and PaCo-GRPO achieves state-of-the-art consistency performance with improved training efficiency and stability. Together, these results highlight the promise of PaCo-RL as a practical and scalable solution for consistent image generation. The project page is available at https://x-gengroup.github.io/HomePage_PaCo-RL/.