DPOK：用於微調文本到圖像擴散模型的強化學習

摘要

從人類反饋中學習已被證明可以改善文本到圖像模型。這些技術首先學習捕捉人類在任務中關心的獎勵函數，然後基於所學習的獎勵函數改進模型。儘管已經研究了相對簡單的方法（例如，基於獎勵分數的拒絕抽樣），但使用獎勵函數來微調文本到圖像模型仍然具有挑戰性。在這項工作中，我們提出使用在線強化學習（RL）來微調文本到圖像模型。我們專注於擴散模型，將微調任務定義為一個RL問題，並使用策略梯度來更新預訓練的文本到圖像擴散模型，以最大化反饋訓練獎勵。我們的方法名為DPOK，將策略優化與KL正則化相結合。我們對RL微調和監督微調的KL正則化進行了分析。在我們的實驗中，我們展示了DPOK在圖像文本對齊和圖像質量方面通常優於監督微調。

English

Learning from human feedback has been shown to improve text-to-image models. These techniques first learn a reward function that captures what humans care about in the task and then improve the models based on the learned reward function. Even though relatively simple approaches (e.g., rejection sampling based on reward scores) have been investigated, fine-tuning text-to-image models with the reward function remains challenging. In this work, we propose using online reinforcement learning (RL) to fine-tune text-to-image models. We focus on diffusion models, defining the fine-tuning task as an RL problem, and updating the pre-trained text-to-image diffusion models using policy gradient to maximize the feedback-trained reward. Our approach, coined DPOK, integrates policy optimization with KL regularization. We conduct an analysis of KL regularization for both RL fine-tuning and supervised fine-tuning. In our experiments, we show that DPOK is generally superior to supervised fine-tuning with respect to both image-text alignment and image quality.

DPOK：用於微調文本到圖像擴散模型的強化學習

DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models

摘要

Support