긴 문맥 조정을 통한 비디오 생성

초록

최근 비디오 생성 기술의 발전으로 확장 가능한 디퓨전 트랜스포머를 사용하여 사실적이고 1분 길이의 단일 샷 비디오를 생성할 수 있게 되었습니다. 그러나 실제 세계의 내러티브 비디오는 시각적 및 동적 일관성을 유지하며 여러 샷으로 구성된 장면이 필요합니다. 본 연구에서는 사전 훈련된 단일 샷 비디오 디퓨전 모델의 컨텍스트 윈도우를 확장하여 장면 수준의 일관성을 데이터로부터 직접 학습하는 Long Context Tuning(LCT) 훈련 패러다임을 소개합니다. 우리의 방법은 개별 샷에 대한 전체 주의 메커니즘을 확장하여 장면 내 모든 샷을 포함하도록 하고, 인터리브된 3D 위치 임베딩과 비동기 노이즈 전략을 통합하여 추가 매개변수 없이 공동 및 자동 회귀 샷 생성을 가능하게 합니다. LCT 이후 양방향 주의 메커니즘을 갖춘 모델은 컨텍스트-인과적 주의를 통해 추가로 미세 조정될 수 있으며, 효율적인 KV 캐시를 사용한 자동 회귀 생성을 용이하게 합니다. 실험 결과, LCT 이후의 단일 샷 모델이 일관된 다중 샷 장면을 생성할 수 있고, 구성적 생성 및 인터랙티브 샷 확장과 같은 새로운 기능을 보여주어 보다 실용적인 시각적 콘텐츠 제작의 길을 열어줍니다. 자세한 내용은 https://guoyww.github.io/projects/long-context-video/를 참조하십시오.

English

Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and dynamic consistency across shots. In this work, we introduce Long Context Tuning (LCT), a training paradigm that expands the context window of pre-trained single-shot video diffusion models to learn scene-level consistency directly from data. Our method expands full attention mechanisms from individual shots to encompass all shots within a scene, incorporating interleaved 3D position embedding and an asynchronous noise strategy, enabling both joint and auto-regressive shot generation without additional parameters. Models with bidirectional attention after LCT can further be fine-tuned with context-causal attention, facilitating auto-regressive generation with efficient KV-cache. Experiments demonstrate single-shot models after LCT can produce coherent multi-shot scenes and exhibit emerging capabilities, including compositional generation and interactive shot extension, paving the way for more practical visual content creation. See https://guoyww.github.io/projects/long-context-video/ for more details.