视频内指令:视觉信号作为生成控制
In-Video Instructions: Visual Signals as Generative Control
November 24, 2025
作者: Gongfan Fang, Xinyin Ma, Xinchao Wang
cs.AI
摘要
近期,大规模视频生成模型展现出强大的视觉能力,能够根据当前观察中的逻辑与物理线索预测符合逻辑的未来帧。本研究探讨如何将这种能力应用于可控的图像到视频生成——通过将帧内嵌入的视觉信号解读为指令,我们将其称为"视频内指令"范式。与基于文本提示的全局性粗粒度控制不同,视频内指令通过叠加文字、箭头或轨迹等元素,将用户引导直接编码至视觉域。该方法通过为不同物体分配独立指令,在视觉主体与预期动作之间建立显式、空间感知且无歧义的对应关系。在Veo 3.1、Kling 2.5和Wan 2.2三种前沿生成器上的大量实验表明,视频模型能够可靠地解析并执行此类视觉嵌入指令,尤其在复杂多物体场景中表现突出。
English
Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.