보상 역전파를 통한 텍스트-이미지 확산 모델 정렬

초록

텍스트-이미지 확산 모델은 최근 매우 대규모의 비지도 또는 약한 지도 학습 텍스트-이미지 데이터셋을 기반으로 이미지 생성 분야의 최전선에 등장했습니다. 비지도 학습으로 훈련되기 때문에, 인간이 인지하는 이미지 품질 극대화, 이미지-텍스트 정렬, 윤리적 이미지 생성과 같은 다운스트림 작업에서의 행동을 제어하는 것은 어려운 문제입니다. 최근 연구들은 높은 분산을 보이는 그래디언트 추정기로 악명 높은 일반 강화 학습을 사용하여 확산 모델을 다운스트림 보상 함수에 맞게 미세 조정했습니다. 본 논문에서는 디노이징 과정을 통해 보상 그래디언트의 종단 간 역전파를 사용하여 확산 모델을 다운스트림 보상 함수에 맞추는 AlignProp 방법을 제안합니다. 이러한 역전파를 단순히 구현할 경우 현대적인 텍스트-이미지 모델의 편미분을 저장하기 위해 과도한 메모리 자원이 필요하지만, AlignProp은 저순위 어댑터 가중치 모듈을 미세 조정하고 그래디언트 체크포인팅을 사용하여 메모리 사용을 실용적으로 만듭니다. 우리는 AlignProp을 이미지-텍스트 의미론적 정렬, 미학, 압축성, 객체 수의 제어 가능성 및 이들의 조합과 같은 다양한 목표에 맞게 확산 모델을 미세 조정하는 데 테스트했습니다. AlignProp이 대안들보다 더 적은 훈련 단계에서 더 높은 보상을 달성하며 개념적으로 더 단순하여, 관심 있는 미분 가능한 보상 함수를 위해 확산 모델을 최적화하는 직관적인 선택이 됨을 보여줍니다. 코드와 시각화 결과는 https://align-prop.github.io/에서 확인할 수 있습니다.

English

Text-to-image diffusion models have recently emerged at the forefront of image generation, powered by very large-scale unsupervised or weakly supervised text-to-image training datasets. Due to their unsupervised training, controlling their behavior in downstream tasks, such as maximizing human-perceived image quality, image-text alignment, or ethical image generation, is difficult. Recent works finetune diffusion models to downstream reward functions using vanilla reinforcement learning, notorious for the high variance of the gradient estimators. In this paper, we propose AlignProp, a method that aligns diffusion models to downstream reward functions using end-to-end backpropagation of the reward gradient through the denoising process. While naive implementation of such backpropagation would require prohibitive memory resources for storing the partial derivatives of modern text-to-image models, AlignProp finetunes low-rank adapter weight modules and uses gradient checkpointing, to render its memory usage viable. We test AlignProp in finetuning diffusion models to various objectives, such as image-text semantic alignment, aesthetics, compressibility and controllability of the number of objects present, as well as their combinations. We show AlignProp achieves higher rewards in fewer training steps than alternatives, while being conceptually simpler, making it a straightforward choice for optimizing diffusion models for differentiable reward functions of interest. Code and Visualization results are available at https://align-prop.github.io/.

보상 역전파를 통한 텍스트-이미지 확산 모델 정렬

Aligning Text-to-Image Diffusion Models with Reward Backpropagation

초록

Support