편집 보상: 지시 기반 이미지 편집을 위한 인간 정렬 보상 모델

초록

최근 자연어 지시를 통한 이미지 편집 분야에서 큰 진전이 이루어졌다. GPT-Image-1, Seedream, Google-Nano-Banana와 같은 여러 폐쇄형 모델들이 매우 유망한 성과를 보여주었다. 그러나 오픈소스 모델들은 여전히 뒤처져 있다. 주요 병목 현상은 고품질 합성 훈련 데이터를 확장하기 위한 신뢰할 수 있는 보상 모델의 부재이다. 이 중요한 문제를 해결하기 위해, 우리는 새로운 대규모 인간 선호도 데이터셋으로 훈련된 \mname을 구축했다. 이 데이터셋은 엄격한 프로토콜에 따라 훈련된 전문가들이 주석을 단 20만 개 이상의 선호도 쌍을 포함하고 있다. \mname은 지시 기반 이미지 편집 작업에서 인간의 선호도와 우수한 일치를 보여준다. 실험 결과, \mname은 GenAI-Bench, AURORA-Bench, ImagenHub 및 우리의 새로운 \benchname과 같은 기존 벤치마크에서 최첨단 인간 상관관계를 달성하며, 다양한 VLM-as-judge 모델들을 능가한다. 또한, 우리는 \mname을 사용하여 기존의 노이즈가 많은 ShareGPT-4o-Image 데이터셋에서 고품질 부분집합을 선택했다. 선택된 부분집합으로 훈련된 Step1X-Edit은 전체 데이터셋으로 훈련한 것보다 상당한 개선을 보여준다. 이는 \mname이 이미지 편집을 위한 고품질 훈련 데이터를 확장하기 위한 보상 모델로 사용될 수 있음을 보여준다. 더 나아가, 강력한 일치도는 강화 학습 기반 사후 훈련 및 테스트 시간 확장과 같은 고급 응용 프로그램의 잠재력을 시사한다. \mname과 그 훈련 데이터셋은 커뮤니티가 더 많은 고품질 이미지 편집 훈련 데이터셋을 구축할 수 있도록 공개될 예정이다.

English

Recently, we have witnessed great progress in image editing with natural language instructions. Several closed-source models like GPT-Image-1, Seedream, and Google-Nano-Banana have shown highly promising progress. However, the open-source models are still lagging. The main bottleneck is the lack of a reliable reward model to scale up high-quality synthetic training data. To address this critical bottleneck, we built \mname, trained with our new large-scale human preference dataset, meticulously annotated by trained experts following a rigorous protocol containing over 200K preference pairs. \mname demonstrates superior alignment with human preferences in instruction-guided image editing tasks. Experiments show that \mname achieves state-of-the-art human correlation on established benchmarks such as GenAI-Bench, AURORA-Bench, ImagenHub, and our new \benchname, outperforming a wide range of VLM-as-judge models. Furthermore, we use \mname to select a high-quality subset from the existing noisy ShareGPT-4o-Image dataset. We train Step1X-Edit on the selected subset, which shows significant improvement over training on the full set. This demonstrates \mname's ability to serve as a reward model to scale up high-quality training data for image editing. Furthermore, its strong alignment suggests potential for advanced applications like reinforcement learning-based post-training and test-time scaling of image editing models. \mname with its training dataset will be released to help the community build more high-quality image editing training datasets.

편집 보상: 지시 기반 이미지 편집을 위한 인간 정렬 보상 모델

EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

초록

Support