DPOK: 텍스트-이미지 확산 모델 미세 조정을 위한 강화 학습

초록

인간 피드백을 통해 학습하는 것이 텍스트-이미지 모델을 개선하는 데 효과적임이 입증되었다. 이러한 기술은 먼저 인간이 해당 작업에서 중요하게 여기는 요소를 포착하는 보상 함수를 학습한 다음, 학습된 보상 함수를 기반으로 모델을 개선한다. 비교적 단순한 접근법(예: 보상 점수 기반의 거절 샘플링)이 연구되었음에도 불구하고, 보상 함수를 사용하여 텍스트-이미지 모델을 미세 조정하는 것은 여전히 어려운 과제로 남아 있다. 본 연구에서는 온라인 강화 학습(RL)을 활용하여 텍스트-이미지 모델을 미세 조정하는 방법을 제안한다. 우리는 확산 모델에 초점을 맞추어 미세 조정 작업을 RL 문제로 정의하고, 사전 학습된 텍스트-이미지 확산 모델을 피드백으로 학습된 보상을 최대화하기 위해 정책 경사법을 사용하여 업데이트한다. 우리의 접근법인 DPOK는 정책 최적화와 KL 정규화를 통합한다. 우리는 RL 미세 조정과 지도 미세 조정 모두에 대해 KL 정규화를 분석한다. 실험 결과, DPOK는 이미지-텍스트 정렬 및 이미지 품질 측면에서 지도 미세 조정보다 일반적으로 우수함을 보여준다.

English

Learning from human feedback has been shown to improve text-to-image models. These techniques first learn a reward function that captures what humans care about in the task and then improve the models based on the learned reward function. Even though relatively simple approaches (e.g., rejection sampling based on reward scores) have been investigated, fine-tuning text-to-image models with the reward function remains challenging. In this work, we propose using online reinforcement learning (RL) to fine-tune text-to-image models. We focus on diffusion models, defining the fine-tuning task as an RL problem, and updating the pre-trained text-to-image diffusion models using policy gradient to maximize the feedback-trained reward. Our approach, coined DPOK, integrates policy optimization with KL regularization. We conduct an analysis of KL regularization for both RL fine-tuning and supervised fine-tuning. In our experiments, we show that DPOK is generally superior to supervised fine-tuning with respect to both image-text alignment and image quality.

DPOK: 텍스트-이미지 확산 모델 미세 조정을 위한 강화 학습

DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models

초록

Support