자율 회귀 이미지 편집을 위한 강화 학습의 가능성

초록

우리는 다양한 이미지 편집 작업에서 성능을 향상시키기 위한 세 가지 전략을 탐구합니다: 지도 미세 조정(SFT), 강화 학습(RL), 그리고 사고 연쇄(CoT) 추론입니다. 이러한 모든 요소를 일관된 프레임워크 내에서 연구하기 위해, 우리는 텍스트와 시각적 토큰을 통합적으로 처리하는 자기회귀적 다중모달 모델을 채택했습니다. 우리는 대규모 다중모달 LLM 검증기와 결합된 RL이 이러한 전략 중 가장 효과적이라는 것을 발견했습니다. 그 결과, 우리는 EARL: Editing with Autoregression and RL을 공개합니다. 이는 강력한 RL 기반 이미지 편집 모델로, 훨씬 적은 학습 데이터를 사용함에도 불구하고 다양한 편집 작업에서 강력한 기준선과 경쟁력 있는 성능을 보입니다. 따라서 EARL은 이미지 편집 분야에서 자기회귀적 다중모달 모델의 최전선을 더욱 확장합니다. 우리는 코드, 학습 데이터, 그리고 학습된 모델을 https://github.com/mair-lab/EARL에서 공개합니다.

English

We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learning (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.

자율 회귀 이미지 편집을 위한 강화 학습의 가능성

The Promise of RL for Autoregressive Image Editing

초록

Support