最初のフレームが動画コンテンツカスタマイズの鍵となる

要旨

映像生成モデルにおける最初のフレームの役割とは何か？従来、それはビデオの時空間的始点と見なされ、単に後続のアニメーションのための種として扱われてきた。本研究では、根本的に異なる視点を明らかにする：映像モデルは暗黙的に最初のフレームを、生成過程で後から再利用するための視覚的実体を格納する概念的メモリバッファとして扱っている。この知見を活用することで、アーキテクチャの変更や大規模なファインチューニングなしに、わずか20～50の学習事例を用いて多様なシナリオで頑健かつ汎用的な映像コンテンツのカスタマイズが可能であることを示す。これは、参照ベースの映像カスタマイズにおける映像生成モデルの強力ながら見過ごされてきた能力を明らかにするものである。

English

What role does the first frame play in video generation models? Traditionally, it's viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it's possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.

最初のフレームが動画コンテンツカスタマイズの鍵となる

First Frame Is the Place to Go for Video Content Customization

要旨

Support