ChatPaper.aiChatPaper

ThinkRL-Edit:基于强化学习思维模型的推理式图像编辑方法

ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing

January 6, 2026
作者: Hengjia Li, Liming Jiang, Qing Yan, Yizhi Song, Hao Kang, Zichuan Liu, Xin Lu, Boxi Wu, Deng Cai
cs.AI

摘要

基於指令驅動的圖像編輯技術雖藉由統一多模態生成模型快速發展,但其底層的視覺推理能力仍存在侷限,導致在推理密集型編輯任務中表現欠佳。強化學習(RL)曾被用於提升圖像編輯質量,卻面臨三大挑戰:(1)推理探索受限於去噪隨機性;(2)獎勵融合存在偏差;(3)基於視覺語言模型(VLM)的指令獎勵不穩定。為此,我們提出ThinkRL-Edit——一個將視覺推理與圖像合成解耦的推理中心化RL框架,其推理探索範圍突破去噪過程的束縛。我們在線上採樣中引入基於思維鏈(CoT)的推理採樣機制,通過生成前的規劃與反思階段,迫使模型在確定視覺輸出前探索多種語義假設並驗證其合理性。為避免加權聚合的失效,我們提出跨多獎勵維度的無偏差鏈式偏好分組策略。此外,以二元檢查清單替代間隔型VLM評分,為複雜推理提供更精確、低方差且可解釋的獎勵。實驗表明,本方法在推理密集型圖像編輯任務中顯著優於現有技術,生成結果兼具指令忠實性、視覺連貫性與語義合理性。
English
Instruction-driven image editing with unified multimodal generative models has advanced rapidly, yet their underlying visual reasoning remains limited, leading to suboptimal performance on reasoning-centric edits. Reinforcement learning (RL) has been investigated for improving the quality of image editing, but it faces three key challenges: (1) limited reasoning exploration confined to denoising stochasticity, (2) biased reward fusion, and (3) unstable VLM-based instruction rewards. In this work, we propose ThinkRL-Edit, a reasoning-centric RL framework that decouples visual reasoning from image synthesis and expands reasoning exploration beyond denoising. To the end, we introduce Chain-of-Thought (CoT)-based reasoning sampling with planning and reflection stages prior to generation in online sampling, compelling the model to explore multiple semantic hypotheses and validate their plausibility before committing to a visual outcome. To avoid the failures of weighted aggregation, we propose an unbiased chain preference grouping strategy across multiple reward dimensions. Moreover, we replace interval-based VLM scores with a binary checklist, yielding more precise, lower-variance, and interpretable rewards for complex reasoning. Experiments show our method significantly outperforms prior work on reasoning-centric image editing, producing instruction-faithful, visually coherent, and semantically grounded edits.
PDF30January 9, 2026