首帧定位：视频内容定制的最佳起点

摘要

在影片生成模型中，首幀究竟扮演什麼角色？傳統觀點將其視為影片時空序列的起點，僅是後續動畫生成的種子。本研究揭示了截然不同的視角：影片模型隱性地將首幀作為概念記憶緩衝區，儲存視覺實體以供後續生成階段重複調用。基於此發現，我們證明僅需20-50個訓練樣本，無需調整模型架構或進行大規模微調，即可在多樣化場景中實現強健且通用的影片內容客製化。這項發現揭示了影片生成模型在參照式影片客製化方面長期被忽視的強大能力。

English

What role does the first frame play in video generation models? Traditionally, it's viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it's possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.

首帧定位：视频内容定制的最佳起点

First Frame Is the Place to Go for Video Content Customization

摘要

Support