从规划到像素：学习规划和编排以实现开放式图像编辑

摘要

现代图像编辑模型能够生成逼真的结果，但在处理抽象的多步骤指令（例如“让这张广告看起来更符合素食主义风格”）时仍存在困难。以往的基于智能体的方法虽然能分解此类任务，但依赖于手工设计的流程或通过模仿教师模型，这限制了其灵活性，并使学习过程与实际编辑效果相脱离。我们提出了一种用于长期图像编辑的经验性框架，其中规划器生成结构化的原子分解步骤，协调器则选择工具和区域来执行每个步骤。视觉语言裁判会根据指令遵循度和视觉质量提供基于结果评估的奖励。协调器通过训练来最大化这些奖励，而成功的轨迹被用于优化规划器。通过将规划与基于奖励的执行过程紧密结合，我们的方法相比单步骤或基于规则的多步骤基线，能够生成更连贯、更可靠的编辑结果。

English

Modern image editing models produce realistic results but struggle with abstract, multi step instructions (e.g., ``make this advertisement more vegetarian-friendly''). Prior agent based methods decompose such tasks but rely on handcrafted pipelines or teacher imitation, limiting flexibility and decoupling learning from actual editing outcomes. We propose an experiential framework for long-horizon image editing, where a planner generates structured atomic decompositions and an orchestrator selects tools and regions to execute each step. A vision language judge provides outcome-based rewards for instruction adherence and visual quality. The orchestrator is trained to maximize these rewards, and successful trajectories are used to refine the planner. By tightly coupling planning with reward driven execution, our approach yields more coherent and reliable edits than single-step or rule-based multistep baselines.