ChatPaper.aiChatPaper

MotionStream:基于交互式运动控制的实时视频生成技术

MotionStream: Real-Time Video Generation with Interactive Motion Controls

November 3, 2025
作者: Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Schechtman, Xun Huang
cs.AI

摘要

当前基于运动条件的视频生成方法存在两大瓶颈:生成延迟高达每分钟级别,且非因果处理机制无法实现实时交互。我们提出的MotionStream框架在单块GPU上实现了亚秒级延迟和最高29帧/秒的流式生成。该技术路径首先通过运动控制增强文本到视频模型,使其能生成符合全局文本提示与局部运动引导的高质量视频,但尚未实现实时推理。为此,我们采用带分布匹配蒸馏的自我强制算法,将这种双向教师模型蒸馏为因果学生模型,从而实现实时流式推理。针对长时域(可能无限长度)视频生成,我们攻克了三大关键挑战:(1) 弥合有限训练时长与无限时域外推之间的领域差距;(2) 通过防止误差累积保持生成质量;(3) 在上下文窗口扩展时维持快速推理,避免计算成本增长。解决方案的核心是精心设计的滑动窗口因果注意力机制与注意力锚点技术。通过训练阶段结合注意力锚点、KV缓存滚动及自推演策略,我们以固定上下文窗口精准模拟推理时的外推过程,实现任意长度视频的恒速生成。该模型在运动跟随精度与视频质量方面达到业界最优水平,同时将生成速度提升两个数量级,独树一帜地实现无限长度流式生成。借助MotionStream,用户可实时绘制轨迹、控制摄像机或迁移运动模式,并即刻观看到生成效果,真正实现交互式创作体验。
English
Current motion-conditioned video generation methods suffer from prohibitive latency (minutes per video) and non-causal processing that prevents real-time interaction. We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU. Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly. As such, we distill this bidirectional teacher into a causal student through Self Forcing with Distribution Matching Distillation, enabling real-time streaming inference. Several key challenges arise when generating videos of long, potentially infinite time-horizons: (1) bridging the domain gap from training on finite length and extrapolating to infinite horizons, (2) sustaining high quality by preventing error accumulation, and (3) maintaining fast inference, without incurring growth in computational cost due to increasing context windows. A key to our approach is introducing carefully designed sliding-window causal attention, combined with attention sinks. By incorporating self-rollout with attention sinks and KV cache rolling during training, we properly simulate inference-time extrapolations with a fixed context window, enabling constant-speed generation of arbitrarily long videos. Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming. With MotionStream, users can paint trajectories, control cameras, or transfer motion, and see results unfold in real-time, delivering a truly interactive experience.
PDF296January 19, 2026