첫 프레임, 맞춤형 영상 콘텐츠의 시작

초록

영상 생성 모델에서 첫 번째 프레임은 어떤 역할을 할까요? 기존에는 이를 영상의 시공간적 시작점, 즉 후속 애니메이션을 위한 단순한 시드(seed)로 인식해왔습니다. 본 연구에서는 이와 근본적으로 다른 관점을 제시합니다: 영상 모델은 첫 번째 프레임을 생성 과정 중 후반에 재사용하기 위한 시각적 개체를 저장하는 개념적 메모리 버퍼로 암묵적으로 취급한다는 사실을 밝혔습니다. 이러한 통찰력을 바탕으로, 아키텍처 변경이나 대규모 파인튜닝 없이 단 20-50개의 학습 예시만으로 다양한 시나리오에서 강력하고 일반화된 영상 콘텐츠 맞춤 설정을 달성할 수 있음을 보여줍니다. 이는 참조 기반 영상 맞춤 설정을 위한 영상 생성 모델의 강력하면서도 간과되었던 능력을 드러내는 것입니다.

English

What role does the first frame play in video generation models? Traditionally, it's viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it's possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.

첫 프레임, 맞춤형 영상 콘텐츠의 시작

First Frame Is the Place to Go for Video Content Customization

초록

Support