從規劃到像素：學習規劃與統籌以實現開放式圖像編輯

摘要

現代影像編輯模型能產生逼真的結果，但在處理抽象、多步驟指令（例如「讓這則廣告更符合素食理念」）時仍面臨挑戰。現有的基於代理方法雖能拆解此類任務，但依賴於手動建構的流程或教師模仿，導致靈活性受限，且學習過程與實際編輯結果脫鉤。我們提出一個經驗性框架來處理長時序影像編輯，其中規劃器生成結構化的原子拆解步驟，而協調器則選擇工具與區域來執行每個步驟。視覺語言評判器會根據結果提供基於獎勵的指令遵循度與視覺品質評估。協調器經由訓練最大化這些獎勵，並利用成功的軌跡來優化規劃器。透過將規劃與獎勵驅動的執行過程緊密結合，我們的方法能產生比單步式或規則驅動的多步基線更一致且可靠的編輯成果。

English

Modern image editing models produce realistic results but struggle with abstract, multi step instructions (e.g., ``make this advertisement more vegetarian-friendly''). Prior agent based methods decompose such tasks but rely on handcrafted pipelines or teacher imitation, limiting flexibility and decoupling learning from actual editing outcomes. We propose an experiential framework for long-horizon image editing, where a planner generates structured atomic decompositions and an orchestrator selects tools and regions to execute each step. A vision language judge provides outcome-based rewards for instruction adherence and visual quality. The orchestrator is trained to maximize these rewards, and successful trajectories are used to refine the planner. By tightly coupling planning with reward driven execution, our approach yields more coherent and reliable edits than single-step or rule-based multistep baselines.