보상 그래디언트를 통한 비디오 확산 정렬

초록

기초 비디오 확산 모델 구축을 위한 상당한 진전을 이루었습니다. 이러한 모델들은 대규모 비지도 데이터를 사용해 훈련되기 때문에, 특정 하위 작업에 맞게 모델을 적응시키는 것이 중요해졌습니다. 지도 학습을 통한 미세 조정을 위해선 대상 비디오 데이터셋을 수집해야 하는데, 이는 어렵고 지루한 작업입니다. 본 연구에서는 강력한 시각 판별 모델 위에서 선호도를 통해 학습된 사전 훈련된 보상 모델을 활용하여 비디오 확산 모델을 적응시킵니다. 이러한 모델들은 생성된 RGB 픽셀에 대한 밀집된 그래디언트 정보를 포함하고 있어, 비디오와 같은 복잡한 탐색 공간에서 효율적인 학습에 필수적입니다. 보상 모델에서 비디오 확산 모델로 그래디언트를 역전파함으로써, 계산 및 샘플 효율적인 비디오 확산 모델 정렬이 가능함을 보여줍니다. 다양한 보상 모델과 비디오 확산 모델에 걸쳐 결과를 제시하며, 우리의 접근 방식이 기존의 그래디언트 없는 접근법보다 보상 질의 및 계산 측면에서 훨씬 더 효율적으로 학습할 수 있음을 입증합니다. 코드, 모델 가중치 및 추가 시각화 자료는 https://vader-vid.github.io에서 확인할 수 있습니다.

English

We have made significant progress towards building foundational video diffusion models. As these models are trained using large-scale unsupervised data, it has become crucial to adapt these models to specific downstream tasks. Adapting these models via supervised fine-tuning requires collecting target datasets of videos, which is challenging and tedious. In this work, we utilize pre-trained reward models that are learned via preferences on top of powerful vision discriminative models to adapt video diffusion models. These models contain dense gradient information with respect to generated RGB pixels, which is critical to efficient learning in complex search spaces, such as videos. We show that backpropagating gradients from these reward models to a video diffusion model can allow for compute and sample efficient alignment of the video diffusion model. We show results across a variety of reward models and video diffusion models, demonstrating that our approach can learn much more efficiently in terms of reward queries and computation than prior gradient-free approaches. Our code, model weights,and more visualization are available at https://vader-vid.github.io.

보상 그래디언트를 통한 비디오 확산 정렬

Video Diffusion Alignment via Reward Gradients

초록

Support