강화 학습을 활용한 확산 모델 학습

초록

확산 모델(Diffusion models)은 로그 가능도 목적 함수에 대한 근사치를 사용하여 학습되는 유연한 생성 모델의 한 종류입니다. 그러나 확산 모델의 대부분의 사용 사례는 가능도 자체보다는 인간이 인지하는 이미지 품질이나 약물 효과성과 같은 하위 목표에 더 관심을 둡니다. 본 논문에서는 이러한 목표를 직접 최적화하기 위해 확산 모델에 강화 학습 방법을 적용하는 방식을 탐구합니다. 우리는 노이즈 제거(denoising)를 다단계 의사결정 문제로 설정함으로써 정책 경사 알고리즘의 한 종류를 가능하게 하는 방법을 설명하며, 이를 노이즈 제거 확산 정책 최적화(DDPO, Denoising Diffusion Policy Optimization)라고 부릅니다. DDPO는 대안적인 보상 가중 가능도 접근법보다 더 효과적임을 보여줍니다. 실험적으로, DDPO는 텍스트-이미지 확산 모델을 프롬프트로 표현하기 어려운 목표(예: 이미지 압축성)나 인간 피드백에서 도출된 목표(예: 미적 품질)에 적응시킬 수 있습니다. 마지막으로, DDPO가 시각-언어 모델의 피드백을 사용하여 추가 데이터 수집이나 인간 주석 없이도 프롬프트-이미지 정렬을 개선할 수 있음을 보여줍니다.

English

Diffusion models are a class of flexible generative models trained with an approximation to the log-likelihood objective. However, most use cases of diffusion models are not concerned with likelihoods, but instead with downstream objectives such as human-perceived image quality or drug effectiveness. In this paper, we investigate reinforcement learning methods for directly optimizing diffusion models for such objectives. We describe how posing denoising as a multi-step decision-making problem enables a class of policy gradient algorithms, which we refer to as denoising diffusion policy optimization (DDPO), that are more effective than alternative reward-weighted likelihood approaches. Empirically, DDPO is able to adapt text-to-image diffusion models to objectives that are difficult to express via prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Finally, we show that DDPO can improve prompt-image alignment using feedback from a vision-language model without the need for additional data collection or human annotation.

강화 학습을 활용한 확산 모델 학습

Training Diffusion Models with Reinforcement Learning

초록

Support