首帧定位：视频内容定制的理想起点

摘要

在视频生成模型中，首帧图像究竟扮演着何种角色？传统观点将其视为视频的时空起点，仅仅是后续动态的生成种子。而本研究揭示了截然不同的视角：视频模型隐式地将首帧作为概念记忆缓冲区，存储视觉实体以供生成过程中重复调用。基于这一发现，我们仅需20-50个训练样本即可在不改变模型架构或进行大规模微调的情况下，实现多样化场景中稳健通用的视频内容定制。这揭示了视频生成模型基于参考内容进行视频定制的强大却长期被忽视的能力。

English

What role does the first frame play in video generation models? Traditionally, it's viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it's possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.