ChatPaper.aiChatPaper

视频即提示:视频生成的统一语义控制

Video-As-Prompt: Unified Semantic Control for Video Generation

October 23, 2025
作者: Yuxuan Bian, Xin Chen, Zenan Li, Tiancheng Zhi, Shen Sang, Linjie Luo, Qiang Xu
cs.AI

摘要

统一且可泛化的语义控制视频生成仍是关键的开放性挑战。现有方法要么通过强加基于结构控制的像素级先验而引入伪影,要么依赖不可泛化的条件特定微调或任务专用架构。我们提出Video-As-Prompt(VAP)这一新范式,将该问题重新定义为上下文生成任务。VAP利用参考视频作为直接语义提示,通过即插即用的混合专家Transformer(MoT)引导冻结的视频扩散Transformer(DiT)。该架构可防止灾难性遗忘,并采用具有时序偏置的位置编码进行引导,消除虚假映射先验以实现鲁棒的上下文检索。为支撑该方法并推动未来研究,我们构建了VAP-Data——目前最大的语义控制视频生成数据集,包含100种语义条件下超过10万组配对视频。作为单一统一模型,VAP为开源方法设立了新标杆,获得38.7%的用户偏好率,媲美领先的条件专用商业模型。VAP强大的零样本泛化能力及对多种下游应用的支持,标志着向通用可控视频生成迈出了重要一步。
English
Unified, generalizable semantic control in video generation remains a critical open challenge. Existing methods either introduce artifacts by enforcing inappropriate pixel-wise priors from structure-based controls, or rely on non-generalizable, condition-specific finetuning or task-specific architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes this problem as in-context generation. VAP leverages a reference video as a direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture prevents catastrophic forgetting and is guided by a temporally biased position embedding that eliminates spurious mapping priors for robust context retrieval. To power this approach and catalyze future research, we built VAP-Data, the largest dataset for semantic-controlled video generation with over 100K paired videos across 100 semantic conditions. As a single unified model, VAP sets a new state-of-the-art for open-source methods, achieving a 38.7% user preference rate that rivals leading condition-specific commercial models. VAP's strong zero-shot generalization and support for various downstream applications mark a significant advance toward general-purpose, controllable video generation.
PDF452December 17, 2025