일관성 모델을 위한 강화 학습: 더 빠른 보안 안내 텍스트-이미지 생성

초록

강화 학습(Reinforcement Learning, RL)은 이미지 품질, 미적 요소, 지시 사항 준수 능력을 포착하는 보상을 직접 최적화함으로써 확산 모델(diffusion models)을 활용한 가이드 이미지 생성 분야를 개선해 왔습니다. 그러나 이렇게 생성된 정책은 확산 모델의 반복적인 샘플링 과정을 그대로 물려받아 생성 속도가 느리다는 한계를 지닙니다. 이러한 한계를 극복하기 위해 일관성 모델(consistency models)은 노이즈를 데이터로 직접 매핑하는 새로운 유형의 생성 모델을 학습하는 방식을 제안했으며, 이를 통해 단 한 번의 샘플링 반복만으로도 이미지를 생성할 수 있는 모델을 구현했습니다. 본 연구에서는 텍스트-이미지 생성 모델을 특정 작업에 맞는 보상으로 최적화하고 빠른 학습 및 추론을 가능하게 하기 위해, RL을 통해 일관성 모델을 미세 조정하는 프레임워크를 제안합니다. 우리의 프레임워크인 RLCM(Reinforcement Learning for Consistency Model)은 일관성 모델의 반복적 추론 과정을 RL 절차로 구성합니다. RLCM은 텍스트-이미지 생성 능력에서 RL로 미세 조정된 확산 모델을 개선하며, 추론 시간의 계산량을 샘플 품질과 교환합니다. 실험적으로, RLCM은 이미지 압축성과 같이 프롬프트로 표현하기 어려운 목표나 인간 피드백에서 도출된 미적 품질과 같은 목표에 텍스트-이미지 일관성 모델을 적응시킬 수 있음을 보여줍니다. RL로 미세 조정된 확산 모델과 비교했을 때, RLCM은 훨씬 빠르게 학습하며, 보상 목표 하에서 측정된 생성 품질을 개선하고, 단 두 번의 추론 단계만으로도 고품질 이미지를 생성함으로써 추론 절차를 가속화합니다. 우리의 코드는 https://rlcm.owenoertell.com에서 확인할 수 있습니다.

English

Reinforcement learning (RL) has improved guided image generation with diffusion models by directly optimizing rewards that capture image quality, aesthetics, and instruction following capabilities. However, the resulting generative policies inherit the same iterative sampling process of diffusion models that causes slow generation. To overcome this limitation, consistency models proposed learning a new class of generative models that directly map noise to data, resulting in a model that can generate an image in as few as one sampling iteration. In this work, to optimize text-to-image generative models for task specific rewards and enable fast training and inference, we propose a framework for fine-tuning consistency models via RL. Our framework, called Reinforcement Learning for Consistency Model (RLCM), frames the iterative inference process of a consistency model as an RL procedure. RLCM improves upon RL fine-tuned diffusion models on text-to-image generation capabilities and trades computation during inference time for sample quality. Experimentally, we show that RLCM can adapt text-to-image consistency models to objectives that are challenging to express with prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Comparing to RL finetuned diffusion models, RLCM trains significantly faster, improves the quality of the generation measured under the reward objectives, and speeds up the inference procedure by generating high quality images with as few as two inference steps. Our code is available at https://rlcm.owenoertell.com

일관성 모델을 위한 강화 학습: 더 빠른 보안 안내 텍스트-이미지 생성

RL for Consistency Models: Faster Reward Guided Text-to-Image Generation

초록

Support