MotionStream:基于交互式运动控制的实时视频生成技术
MotionStream: Real-Time Video Generation with Interactive Motion Controls
November 3, 2025
作者: Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Schechtman, Xun Huang
cs.AI
摘要
当前基于运动条件的视频生成方法存在严重延迟(每视频数分钟)与非因果处理的问题,阻碍了实时交互。我们提出MotionStream,可在单GPU上实现亚秒级延迟与最高29 FPS的流式生成。该方法首先通过运动控制增强文本到视频模型,生成符合全局文本提示与局部运动引导的高质量视频,但无法进行实时推理。为此,我们通过带分布匹配蒸馏的自强制学习,将这种双向教师模型蒸馏为因果学生模型,实现实时流式推理。生成长时间乃至无限时长视频时面临若干关键挑战:(1)弥合有限长度训练与无限时长外推的领域差距;(2)通过防止误差累积维持高质量输出;(3)在上下文窗口持续增长时保持快速推理,避免计算成本增加。本方法的核心是引入精心设计的滑动窗口因果注意力机制与注意力锚点。通过训练阶段结合注意力锚点与KV缓存滚动的自展开策略,我们以固定上下文窗口准确模拟推理时的外推过程,实现任意长度视频的恒速生成。我们的模型在运动跟随与视频质量方面达到最优效果,同时提速两个数量级,独有能力实现无限长度流式生成。借助MotionStream,用户可实时绘制轨迹、控制摄像机或迁移运动,并即时观看生成效果,真正实现交互式体验。
English
Current motion-conditioned video generation methods suffer from prohibitive
latency (minutes per video) and non-causal processing that prevents real-time
interaction. We present MotionStream, enabling sub-second latency with up to 29
FPS streaming generation on a single GPU. Our approach begins by augmenting a
text-to-video model with motion control, which generates high-quality videos
that adhere to the global text prompt and local motion guidance, but does not
perform inference on the fly. As such, we distill this bidirectional teacher
into a causal student through Self Forcing with Distribution Matching
Distillation, enabling real-time streaming inference. Several key challenges
arise when generating videos of long, potentially infinite time-horizons: (1)
bridging the domain gap from training on finite length and extrapolating to
infinite horizons, (2) sustaining high quality by preventing error
accumulation, and (3) maintaining fast inference, without incurring growth in
computational cost due to increasing context windows. A key to our approach is
introducing carefully designed sliding-window causal attention, combined with
attention sinks. By incorporating self-rollout with attention sinks and KV
cache rolling during training, we properly simulate inference-time
extrapolations with a fixed context window, enabling constant-speed generation
of arbitrarily long videos. Our models achieve state-of-the-art results in
motion following and video quality while being two orders of magnitude faster,
uniquely enabling infinite-length streaming. With MotionStream, users can paint
trajectories, control cameras, or transfer motion, and see results unfold in
real-time, delivering a truly interactive experience.