影片作為提示:影片生成的統一語義控制
Video-As-Prompt: Unified Semantic Control for Video Generation
October 23, 2025
作者: Yuxuan Bian, Xin Chen, Zenan Li, Tiancheng Zhi, Shen Sang, Linjie Luo, Qiang Xu
cs.AI
摘要
統一的、可泛化的語義控制影片生成仍是關鍵的開放性挑戰。現有方法要麼因強加基於結構控制的失真像素級先驗而產生偽影,要麼依賴不可泛化的條件特定微調或任務專用架構。我們提出「影片即提示」(VAP)新範式,將此問題重新定義為情境內生成。VAP利用參考影片作為直接語義提示,透過即插即用的混合專家變換器(MoT)引導凍結的影片擴散變換器(DiT)。此架構能防止災難性遺忘,並透過具時間偏置的位置嵌入進行引導,消除虛假映射先驗以實現穩健的情境檢索。為支持此方法並推動未來研究,我們構建了VAP-Data——目前最大的語義控制影片生成數據集,涵蓋100種語義條件下超過10萬組配對影片。作為單一統一模型,VAP為開源方法設立了新標竿,達成38.7%的用户偏好率,可媲美領先的條件專用商業模型。VAP強大的零樣本泛化能力與對多種下游應用的支持,標誌著通用可控影片生成邁出重要一步。
English
Unified, generalizable semantic control in video generation remains a
critical open challenge. Existing methods either introduce artifacts by
enforcing inappropriate pixel-wise priors from structure-based controls, or
rely on non-generalizable, condition-specific finetuning or task-specific
architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes
this problem as in-context generation. VAP leverages a reference video as a
direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via
a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture
prevents catastrophic forgetting and is guided by a temporally biased position
embedding that eliminates spurious mapping priors for robust context retrieval.
To power this approach and catalyze future research, we built VAP-Data, the
largest dataset for semantic-controlled video generation with over 100K paired
videos across 100 semantic conditions. As a single unified model, VAP sets a
new state-of-the-art for open-source methods, achieving a 38.7% user preference
rate that rivals leading condition-specific commercial models. VAP's strong
zero-shot generalization and support for various downstream applications mark a
significant advance toward general-purpose, controllable video generation.