계획에서 픽셀로: 개방형 이미지 편집을 위한 계획 및 조정 학습

초록

현대 이미지 편집 모델은 사실적인 결과를 생성하지만 추상적이고 다단계적인 명령(예: "이 광고를 더 채식 친화적으로 만드세요")을 처리하는 데 어려움을 겪는다. 기존의 에이전트 기반 방법은 이러한 작업을 분해하지만 수작업 파이프라인이나 교사 모방에 의존하여 유연성을 제한하고 학습을 실제 편집 결과로부터 분리시킨다. 본 논문에서는 장기적 이미지 편집을 위한 경험 기반 프레임워크를 제안한다. 이 프레임워크에서 플래너는 구조화된 원자적 분해를 생성하고, 오케스트레이터는 각 단계를 실행할 도구와 영역을 선택한다. 시각 언어 판별기는 명령 준수 및 시각적 품질에 대해 결과 기반 보상을 제공한다. 오케스트레이터는 이러한 보상을 최대화하도록 훈련되며, 성공적인 궤적은 플래너를 개선하는 데 사용된다. 계획을 보상 기반 실행과 긴밀하게 연결함으로써, 우리의 접근 방식은 단일 단계 또는 규칙 기반 다단계 기준선보다 더 일관되고 신뢰할 수 있는 편집을 제공한다.

English

Modern image editing models produce realistic results but struggle with abstract, multi step instructions (e.g., ``make this advertisement more vegetarian-friendly''). Prior agent based methods decompose such tasks but rely on handcrafted pipelines or teacher imitation, limiting flexibility and decoupling learning from actual editing outcomes. We propose an experiential framework for long-horizon image editing, where a planner generates structured atomic decompositions and an orchestrator selects tools and regions to execute each step. A vision language judge provides outcome-based rewards for instruction adherence and visual quality. The orchestrator is trained to maximize these rewards, and successful trajectories are used to refine the planner. By tightly coupling planning with reward driven execution, our approach yields more coherent and reliable edits than single-step or rule-based multistep baselines.