基于离线强化学习的图像风格化推理与智能体规划

摘要

基于直接提示的编辑方法在处理复杂变换时常常失效，因为模糊且主观的提示往往需要对图像修改需求具有细腻的理解。我们的核心思路是：利用组合式图像编辑工具而非直接提示，通过具有显式推理能力的结构化智能体规划来获得更好效果。该结构化规划框架支持对质量评分轨迹进行高效的离线强化学习后训练，从而提升性能。我们提出了一种基于工具的智能体强化学习后训练框架，通过具有思维链推理的结构化规划来解决这一问题。我们的主要贡献包括：（1）基于工具的智能体规划方法，结合正交基元变换的组合库、结构化上下文表征及分步显式推理，将复杂风格化任务分解为可解释的工具序列；（2）合成数据生成流程构建了三个大规模数据集（各包含1万条模拟轨迹），提供推理链、规划方案和质量评分，填补了该领域监督数据的空白；我们的数据集和代码已公开于HuggingFace仓库；（3）作为核心算法贡献的离线强化学习训练方法，可培养具备推理能力的规划器，在视觉质量和指令遵循度上持续超越仅编辑基线；（4）基于40亿和80亿参数Qwen3-VL模型的全面评估表明，在多数组合任务中我们的方法优于其他基线，这一结论已通过人工评估验证。

English

Direct prompt-based editing often fails on complex transformations because vague and subjective prompts often require nuanced understanding of what should be changed in the image. Our core intuition is that leveraging compositional image editing tools rather than direct prompting profits from structured agent-level planning with explicit reasoning, leading to better results. This structured planning framework enables efficient offline RL post-training on quality-scored trajectories to improve performance. We present a tool-based agentic RL post-training framework that addresses this through structured planning with chain-of-thought reasoning. Our key contributions include: (1) A tool-based agentic planning methodology that combines a compositional library of orthogonal primitive transformations, structured context representation, and explicit per-step reasoning to decompose complex styling into interpretable tool sequences. (2) A synthetic data generation pipeline producing three large-scale datasets (each sim10K trajectories) with reasoning chains, plans, and quality scores, as no existing datasets provide such supervision. Our datasets and code are publicly available at the HuggingFace repository. (3) Offline RL training methods for learning planners with reasoning as our core algorithmic contributions, which consistently improve over the Edit-Only baseline in visual quality and instruction following. (4) Comprehensive evaluation across 4B and 8B parameter Qwen3-VL models showing that our methods outperform other baselines in the majority of compositional tasks, validated by human evaluations.