텍스트-투-비디오 모델의 확률적 적응

초록

인터넷 규모의 데이터로 학습된 대형 텍스트-비디오 모델은 임의의 텍스트 설명에서 고품질 비디오를 생성하는 탁월한 능력을 보여주었습니다. 그러나 애니메이션이나 로봇 공학 비디오와 같이 도메인 특화 데이터가 제한된 작업에 이러한 모델을 적용하는 것은 사전 학습된 대형 모델을 미세 조정하는 데 드는 비용이 매우 높기 때문에 상당한 계산적 어려움을 야기합니다. 작은 수정 가능한 구성 요소(예: 프롬프트, 프리픽스 튜닝)가 대형 언어 모델을 모델 가중치에 접근하지 않고도 새로운 작업을 수행하도록 적응시킬 수 있는 방식에서 영감을 받아, 우리는 대형 사전 학습된 텍스트-비디오 모델을 미세 조정 없이 다양한 하위 도메인과 작업에 적응시키는 방법을 연구합니다. 이 문제에 대한 해결책으로, 우리는 Video Adapter를 제안합니다. 이는 대형 사전 학습된 비디오 확산 모델의 점수 함수를 확률적 사전 지식으로 활용하여 작업 특화 소형 비디오 모델의 생성을 안내합니다. 우리의 실험 결과, Video Adapter는 대형 사전 학습된 비디오 모델의 광범위한 지식을 통합하고 고품질을 유지하면서도 애니메이션, 에고센트릭 모델링, 시뮬레이션 및 실제 로봇 공학 데이터 모델링과 같은 다양한 작업에서 고품질의 특화된 비디오를 생성할 수 있는 작업 특화 소형 비디오 모델을 가능하게 합니다. 더 많은 비디오는 https://video-adapter.github.io/에서 확인할 수 있습니다.

English

Large text-to-video models trained on internet-scale data have demonstrated exceptional capabilities in generating high-fidelity videos from arbitrary textual descriptions. However, adapting these models to tasks with limited domain-specific data, such as animation or robotics videos, poses a significant computational challenge, since finetuning a pretrained large model can be prohibitively expensive. Inspired by how a small modifiable component (e.g., prompts, prefix-tuning) can adapt a large language model to perform new tasks without requiring access to the model weights, we investigate how to adapt a large pretrained text-to-video model to a variety of downstream domains and tasks without finetuning. In answering this question, we propose Video Adapter, which leverages the score function of a large pretrained video diffusion model as a probabilistic prior to guide the generation of a task-specific small video model. Our experiments show that Video Adapter is capable of incorporating the broad knowledge and preserving the high fidelity of a large pretrained video model in a task-specific small video model that is able to generate high-quality yet specialized videos on a variety of tasks such as animation, egocentric modeling, and modeling of simulated and real-world robotics data. More videos can be found on the website https://video-adapter.github.io/.

텍스트-투-비디오 모델의 확률적 적응

Probabilistic Adaptation of Text-to-Video Models

초록

Support