提示中繼:多事件影片生成的推論時序控制
Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation
April 11, 2026
作者: Gordon Chen, Ziqi Huang, Ziwei Liu
cs.AI
摘要
影片擴散模型在生成高品質影片方面已取得顯著進展。然而,這些模型難以呈現現實影片中多個事件的時間序列,且缺乏明確機制來控制語義概念的出現時機、持續時長以及多個事件的發生順序。這種控制對於電影級影片合成尤為重要,因為連貫的敘事依賴於事件間精確的時間點、持續時間與轉場效果。當使用單一段落式提示描述複雜事件序列時,模型常出現語義糾纏現象——即本應屬於不同時間點的概念相互滲透,導致文字與影片對齊效果不佳。為解決這些局限,我們提出「提示接力」技術,這是一種無需修改模型架構、不增加計算開銷的即插即用推理時方法,可實現多事件影片生成的細粒度時間控制。該技術通過在交叉注意力機制中引入懲罰項,使每個時間片段僅關注對應的指定提示,讓模型一次僅呈現單一語義概念,從而提升時間提示對齊度、減少語義干擾並增強視覺品質。
English
Video diffusion models have achieved remarkable progress in generating high-quality videos. However, these models struggle to represent the temporal succession of multiple events in real-world videos and lack explicit mechanisms to control when semantic concepts appear, how long they persist, and the order in which multiple events occur. Such control is especially important for movie-grade video synthesis, where coherent storytelling depends on precise timing, duration, and transitions between events. When using a single paragraph-style prompt to describe a sequence of complex events, models often exhibit semantic entanglement, where concepts intended for different moments in the video bleed into one another, resulting in poor text-video alignment. To address these limitations, we propose Prompt Relay, an inference-time, plug-and-play method to enable fine-grained temporal control in multi-event video generation, requiring no architectural modifications and no additional computational overhead. Prompt Relay introduces a penalty into the cross-attention mechanism, so that each temporal segment attends only to its assigned prompt, allowing the model to represent one semantic concept at a time and thereby improving temporal prompt alignment, reducing semantic interference, and enhancing visual quality.