強化學習在自回歸圖像編輯中的應用前景

摘要

我們探討了三種策略來提升多種圖像編輯任務的表現：監督式微調（SFT）、強化學習（RL）以及思維鏈（CoT）推理。為了在一個統一的框架下研究所有這些組件，我們採用了一種自回歸多模態模型，該模型以統一的方式處理文本和視覺標記。我們發現，結合大型多模態LLM驗證器的強化學習是這些策略中最有效的。因此，我們發布了EARL：基於自回歸與強化學習的圖像編輯模型，這是一個強大的基於RL的圖像編輯模型，儘管使用了更少的訓練數據，但在多樣化的編輯任務上與強基準模型相比表現出色。因此，EARL推動了自回歸多模態模型在圖像編輯領域的前沿。我們在https://github.com/mair-lab/EARL上發布了我們的代碼、訓練數據和訓練好的模型。

English

We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learning (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.

強化學習在自回歸圖像編輯中的應用前景

The Promise of RL for Autoregressive Image Editing

摘要

Support