思悟RL编辑:基于强化学习的推理导向图像编辑方法
ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing
January 6, 2026
作者: Hengjia Li, Liming Jiang, Qing Yan, Yizhi Song, Hao Kang, Zichuan Liu, Xin Lu, Boxi Wu, Deng Cai
cs.AI
摘要
指令驱动的多模态统一生成模型在图像编辑领域发展迅速,但其底层视觉推理能力仍存在局限,导致以推理为核心的编辑任务表现欠佳。虽然已有研究探索使用强化学习(RL)提升图像编辑质量,但面临三大挑战:(1)受限于去噪随机性的有限推理探索;(2)存在偏差的奖励融合机制;(3)基于视觉语言模型(VLM)的指令奖励不稳定。本研究提出ThinkRL-Edit——一个将视觉推理与图像合成解耦的推理中心化RL框架,将推理探索拓展至去噪过程之外。具体而言,我们在在线采样中引入基于思维链(CoT)的推理采样机制,通过在生成前设置规划与反思阶段,迫使模型在确定视觉输出前探索多种语义假设并验证其合理性。为避免加权聚合的失效问题,我们提出跨多奖励维度的无偏链式偏好分组策略。此外,采用二元检查清单替代区间式VLM评分,为复杂推理任务提供更精确、低方差且可解释的奖励。实验表明,本方法在以推理为核心的图像编辑任务上显著优于现有技术,能生成符合指令要求、视觉连贯且语义可靠的编辑结果。
English
Instruction-driven image editing with unified multimodal generative models has advanced rapidly, yet their underlying visual reasoning remains limited, leading to suboptimal performance on reasoning-centric edits. Reinforcement learning (RL) has been investigated for improving the quality of image editing, but it faces three key challenges: (1) limited reasoning exploration confined to denoising stochasticity, (2) biased reward fusion, and (3) unstable VLM-based instruction rewards. In this work, we propose ThinkRL-Edit, a reasoning-centric RL framework that decouples visual reasoning from image synthesis and expands reasoning exploration beyond denoising. To the end, we introduce Chain-of-Thought (CoT)-based reasoning sampling with planning and reflection stages prior to generation in online sampling, compelling the model to explore multiple semantic hypotheses and validate their plausibility before committing to a visual outcome. To avoid the failures of weighted aggregation, we propose an unbiased chain preference grouping strategy across multiple reward dimensions. Moreover, we replace interval-based VLM scores with a binary checklist, yielding more precise, lower-variance, and interpretable rewards for complex reasoning. Experiments show our method significantly outperforms prior work on reasoning-centric image editing, producing instruction-faithful, visually coherent, and semantically grounded edits.