提示中继：面向多事件视频生成的推理时态控制

摘要

视频扩散模型在生成高质量视频方面取得了显著进展。然而，这些模型难以准确呈现现实世界视频中多个事件的时序连续性，且缺乏显式机制来控制语义概念的出现时机、持续时长以及多个事件的先后顺序。这种控制在电影级视频合成中尤为重要——连贯的叙事依赖于事件间精确的时间点控制、持续时长和过渡效果。当使用单段式提示词描述复杂事件序列时，模型常出现语义纠缠现象：本应出现在不同时间点的概念相互渗透，导致文本-视频对齐效果不佳。为解决这些局限，我们提出提示词接力（Prompt Relay），一种无需修改模型架构、不增加计算开销的即插即用推理方法，可实现多事件视频生成的细粒度时序控制。该方法通过在交叉注意力机制中引入惩罚项，使每个时间段仅关注其指定的提示词，从而让模型一次只呈现一个语义概念，有效提升时序提示对齐度、减少语义干扰并增强视觉质量。

English

Video diffusion models have achieved remarkable progress in generating high-quality videos. However, these models struggle to represent the temporal succession of multiple events in real-world videos and lack explicit mechanisms to control when semantic concepts appear, how long they persist, and the order in which multiple events occur. Such control is especially important for movie-grade video synthesis, where coherent storytelling depends on precise timing, duration, and transitions between events. When using a single paragraph-style prompt to describe a sequence of complex events, models often exhibit semantic entanglement, where concepts intended for different moments in the video bleed into one another, resulting in poor text-video alignment. To address these limitations, we propose Prompt Relay, an inference-time, plug-and-play method to enable fine-grained temporal control in multi-event video generation, requiring no architectural modifications and no additional computational overhead. Prompt Relay introduces a penalty into the cross-attention mechanism, so that each temporal segment attends only to its assigned prompt, allowing the model to represent one semantic concept at a time and thereby improving temporal prompt alignment, reducing semantic interference, and enhancing visual quality.