ChatPaper.aiChatPaper

PaCo-RL:通过配对奖励建模推进强化学习在一致性图像生成中的应用

PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling

December 2, 2025
作者: Bowen Ping, Chengyou Jia, Minnan Luo, Changliang Xia, Xin Shen, Zhuohang Dang, Hangwei Qian
cs.AI

摘要

一致性图像生成要求在多张图像中忠实保持身份特征、风格元素与逻辑连贯性,这对于故事叙述、角色设计等应用至关重要。由于缺乏捕捉视觉一致性的大规模数据集,且建模人类感知偏好的复杂性较高,监督式训练方法在此任务上面临挑战。本文提出强化学习(RL)作为一种前景广阔的替代方案,它能使模型以无需外部数据的方式学习复杂且主观的视觉标准。为实现这一目标,我们推出了PaCo-RL框架——一个将专用一致性奖励模型与高效RL算法相结合的完整解决方案。其核心组件PaCo-Reward是基于自动化子图配对构建的大规模数据集训练而成的成对一致性评估器,通过生成式自回归评分机制并辅以任务感知指令与思维链推理进行一致性评判。另一组件PaCo-GRPO采用创新的分辨率解耦优化策略显著降低RL成本,同时结合对数调制的多奖励聚合机制确保优化过程的平衡与稳定。在两项代表性子任务上的大量实验表明:PaCo-Reward显著提升了视觉一致性评估与人类感知的对齐度;PaCo-GRPO则以更高的训练效率和稳定性实现了最先进的一致性生成性能。这些成果共同证明了PaCo-RL作为实用可扩展的一致性图像生成解决方案的潜力。项目页面详见:https://x-gengroup.github.io/HomePage_PaCo-RL/。
English
Consistent image generation requires faithfully preserving identities, styles, and logical coherence across multiple images, which is essential for applications such as storytelling and character design. Supervised training approaches struggle with this task due to the lack of large-scale datasets capturing visual consistency and the complexity of modeling human perceptual preferences. In this paper, we argue that reinforcement learning (RL) offers a promising alternative by enabling models to learn complex and subjective visual criteria in a data-free manner. To achieve this, we introduce PaCo-RL, a comprehensive framework that combines a specialized consistency reward model with an efficient RL algorithm. The first component, PaCo-Reward, is a pairwise consistency evaluator trained on a large-scale dataset constructed via automated sub-figure pairing. It evaluates consistency through a generative, autoregressive scoring mechanism enhanced by task-aware instructions and CoT reasons. The second component, PaCo-GRPO, leverages a novel resolution-decoupled optimization strategy to substantially reduce RL cost, alongside a log-tamed multi-reward aggregation mechanism that ensures balanced and stable reward optimization. Extensive experiments across the two representative subtasks show that PaCo-Reward significantly improves alignment with human perceptions of visual consistency, and PaCo-GRPO achieves state-of-the-art consistency performance with improved training efficiency and stability. Together, these results highlight the promise of PaCo-RL as a practical and scalable solution for consistent image generation. The project page is available at https://x-gengroup.github.io/HomePage_PaCo-RL/.
PDF232December 9, 2025