基于离线强化学习的图像风格化推理与智能体规划
Agentic Planning with Reasoning for Image Styling via Offline RL
March 7, 2026
作者: Subhojyoti Mukherjee, Stefano Petrangeli, Branislav Kveton, Trung Bui, Franck Dernoncourt, Arko Mukherjee
cs.AI
摘要
基於直接提示的編輯方法在處理複雜圖像轉換時往往效果不佳,因為模糊且主觀的提示需要對圖像應修改內容具備細緻入微的理解。我們的核心思路是:與其直接使用提示,不如利用組合式圖像編輯工具,通過具有明確推理過程的結構化智能體規劃來獲得更優結果。這種結構化規劃框架能夠對質量評分的軌跡進行高效的離線強化學習後訓練,從而提升性能。我們提出了一個基於工具的智能體強化學習後訓練框架,該框架通過具備思維鏈推理的結構化規劃來解決這一問題。我們的主要貢獻包括:(1)基於工具的智能體規劃方法,結合了正交原始變換的組合庫、結構化上下文表徵以及明確的逐步驟推理,可將複雜風格化任務分解為可解釋的工具序列;(2)合成數據生成流程,構建了三個大規模數據集(各包含1萬條模擬軌跡),提供推理鏈、規劃方案和質量評分,現有數據集均缺乏此類監督信號;(3)作為核心算法貢獻的離線強化學習訓練方法,用於訓練具備推理能力的規劃器,在視覺質量和指令遵循方面持續超越僅編輯基線;(4)在40億和80億參數的Qwen3-VL模型上進行全面評估,表明我們的方法在大多數組合任務中優於其他基線,並通過人工評估驗證。
English
Direct prompt-based editing often fails on complex transformations because vague and subjective prompts often require nuanced understanding of what should be changed in the image. Our core intuition is that leveraging compositional image editing tools rather than direct prompting profits from structured agent-level planning with explicit reasoning, leading to better results. This structured planning framework enables efficient offline RL post-training on quality-scored trajectories to improve performance. We present a tool-based agentic RL post-training framework that addresses this through structured planning with chain-of-thought reasoning. Our key contributions include: (1) A tool-based agentic planning methodology that combines a compositional library of orthogonal primitive transformations, structured context representation, and explicit per-step reasoning to decompose complex styling into interpretable tool sequences. (2) A synthetic data generation pipeline producing three large-scale datasets (each sim10K trajectories) with reasoning chains, plans, and quality scores, as no existing datasets provide such supervision. Our datasets and code are publicly available at the HuggingFace repository. (3) Offline RL training methods for learning planners with reasoning as our core algorithmic contributions, which consistently improve over the Edit-Only baseline in visual quality and instruction following. (4) Comprehensive evaluation across 4B and 8B parameter Qwen3-VL models showing that our methods outperform other baselines in the majority of compositional tasks, validated by human evaluations.