計画からピクセルへ：オープンエンド画像編集のための計画と調整の学習

要旨

現代の画像編集モデルは現実的な結果を生成するが、抽象的な複数ステップの指示（例：「この広告をより菜食主義者向けにする」）には対応が難しい。従来のエージェントベース手法はこのようなタスクを分解するが、手作業によるパイプラインや教師模倣に依存しており、柔軟性が制限され、学習が実際の編集結果から切り離されている。我々は長期にわたる画像編集のための経験的フレームワークを提案する。このフレームワークでは、プランナーが構造化された原子的分解を生成し、オーケストレーターが各ステップを実行するためのツールと領域を選択する。視覚言語判定器が指示への適合性と視覚的品質に基づいた結果ベースの報酬を提供する。オーケストレーターはこれらの報酬を最大化するよう訓練され、成功した軌跡がプランナーの改善に使用される。計画と報酬駆動型実行を密に結合することで、我々のアプローチは単一ステップやルールベースのマルチステップベースラインよりも首尾一貫し信頼性の高い編集を実現する。

English

Modern image editing models produce realistic results but struggle with abstract, multi step instructions (e.g., ``make this advertisement more vegetarian-friendly''). Prior agent based methods decompose such tasks but rely on handcrafted pipelines or teacher imitation, limiting flexibility and decoupling learning from actual editing outcomes. We propose an experiential framework for long-horizon image editing, where a planner generates structured atomic decompositions and an orchestrator selects tools and regions to execute each step. A vision language judge provides outcome-based rewards for instruction adherence and visual quality. The orchestrator is trained to maximize these rewards, and successful trajectories are used to refine the planner. By tightly coupling planning with reward driven execution, our approach yields more coherent and reliable edits than single-step or rule-based multistep baselines.