影片內指令:視覺信號作為生成式控制
In-Video Instructions: Visual Signals as Generative Control
November 24, 2025
作者: Gongfan Fang, Xinyin Ma, Xinchao Wang
cs.AI
摘要
近期,大規模視訊生成模型展現出強大的視覺能力,能夠根據當前觀測的邏輯與物理線索預測未來影格。本研究探討如何將此能力應用於可控式圖像轉視訊生成,透過將影格內嵌的視覺訊號解讀為指令,此範式我們稱之為「影格內視覺指令」。相較於基於文字提示的控制方式(其提供的文字描述本質上具有全域性與粗略性),影格內視覺指令透過疊加文字、箭頭或軌跡等元素,將使用者引導直接編碼於視覺領域。藉由為不同物件分配專屬指令,此方法能在視覺主體與預期動作之間建立明確、空間感知且無歧義的對應關係。在Veo 3.1、Kling 2.5及Wan 2.2三款尖端生成器上的大量實驗表明,視訊模型能可靠解讀並執行此類視覺內嵌指令,尤其在複雜多物件場景中表現突出。
English
Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.