EditScore: 고품질 보상 모델링을 통해 이미지 편집을 위한 온라인 강화 학습의 잠재력 해제

초록

명령어 기반 이미지 편집은 놀라운 발전을 이루었지만, 현재의 모델들은 여전히 복잡한 명령어 처리에 어려움을 겪으며 원하는 결과를 얻기 위해 여러 샘플을 요구하는 경우가 많습니다. 강화 학습(RL)은 유망한 해결책을 제공하지만, 고품질이고 효율적인 보상 신호의 부재로 인해 이미지 편집 분야에서의 도입이 크게 제한되어 왔습니다. 본 연구에서는 이러한 장벽을 극복하기 위한 포괄적인 방법론을 제시하며, 최첨단 특화 보상 모델 개발을 중심으로 접근합니다. 먼저, 편집 품질에 대한 보상 모델을 체계적으로 평가하기 위한 종합 벤치마크인 EditReward-Bench를 소개합니다. 이 벤치마크를 기반으로, 명령어 기반 이미지 편집의 품질을 평가하기 위한 일련의 보상 모델(7B-72B)인 EditScore를 개발합니다. 세심한 데이터 큐레이션과 필터링을 통해 EditScore는 학습된 독점 VLM(Visual Language Model)의 성능을 효과적으로 따라잡습니다. 더 나아가, EditScore의 생성적 특성에 맞춰 설계된 효과적인 자기 앙상블 전략과 결합하여, 가장 큰 규모의 변형 모델은 벤치마크에서 GPT-5를 능가하는 성과를 보입니다. 이후, 고품질 보상 모델이 이미지 편집을 위한 온라인 RL의 핵심 열쇠임을 입증합니다. 실험 결과, 가장 큰 규모의 오픈소스 VLM조차도 효과적인 학습 신호를 제공하지 못하는 반면, EditScore는 효율적이고 강력한 정책 최적화를 가능하게 합니다. 강력한 기본 모델인 OmniGen2에 우리의 프레임워크를 적용한 결과, 최종 모델은 상당하고 일관된 성능 향상을 보여줍니다. 전반적으로, 이 연구는 벤치마킹부터 보상 모델링, RL 훈련에 이르는 이미지 편집 분야의 첫 체계적인 접근법을 제공하며, 고품질의 도메인 특화 보상 모델이 이 분야에서 RL의 잠재력을 최대한 발휘하는 열쇠임을 보여줍니다.

English

Instruction-guided image editing has achieved remarkable progress, yet current models still face challenges with complex instructions and often require multiple samples to produce a desired result. Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been severely hindered by the lack of a high-fidelity, efficient reward signal. In this work, we present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model. We first introduce EditReward-Bench, a comprehensive benchmark to systematically evaluate reward models on editing quality. Building on this benchmark, we develop EditScore, a series of reward models (7B-72B) for evaluating the quality of instruction-guided image editing. Through meticulous data curation and filtering, EditScore effectively matches the performance of learning proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy tailored for the generative nature of EditScore, our largest variant even surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity reward model is the key to unlocking online RL for image editing. Our experiments show that, while even the largest open-source VLMs fail to provide an effective learning signal, EditScore enables efficient and robust policy optimization. Applying our framework to a strong base model, OmniGen2, results in a final model that shows a substantial and consistent performance uplift. Overall, this work provides the first systematic path from benchmarking to reward modeling to RL training in image editing, showing that a high-fidelity, domain-specialized reward model is the key to unlocking the full potential of RL in this domain.

EditScore: 고품질 보상 모델링을 통해 이미지 편집을 위한 온라인 강화 학습의 잠재력 해제

EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

초록

Support