稳定视频无限:基于错误循环的无限长度视频生成
Stable Video Infinity: Infinite-Length Video Generation with Error Recycling
October 10, 2025
作者: Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, Alexandre Alahi
cs.AI
摘要
我们提出了稳定视频无限生成模型(Stable Video Infinity, SVI),该模型能够生成具有高时间一致性、合理场景转换及可控流媒体叙事线索的无限长度视频。尽管现有的长视频生成方法尝试通过手工设计的防漂移策略(如改进的噪声调度器、帧锚定)来缓解累积误差,但它们仍局限于单一提示的外推,生成场景单一且动作重复的视频。我们发现,根本挑战不仅在于误差累积,更在于训练假设(接触干净数据)与测试时自回归现实(基于自生成、易出错输出进行条件化)之间的关键差异。为弥合这一假设差距,SVI引入了误差回收微调(Error-Recycling Fine-Tuning),这是一种新型高效训练方法,它将扩散变换器(DiT)自生成的误差回收为监督提示,从而激励DiT主动识别并纠正自身错误。这一目标通过闭环回收、自回归学习误差注入反馈实现,具体包括:(i) 注入DiT历史误差干预干净输入,模拟流匹配中的误差累积轨迹;(ii) 通过一步双向积分高效近似预测,并利用残差计算误差;(iii) 在离散时间步上动态将误差存入回放记忆库,供新输入时重采样。SVI能够在不增加推理成本的情况下,将视频从秒级扩展至无限时长,同时保持与多种条件(如音频、骨架和文本流)的兼容性。我们在三个基准测试上评估了SVI,涵盖一致性、创造性和条件化设置,全面验证了其多功能性及业界领先地位。
English
We propose Stable Video Infinity (SVI) that is able to generate
infinite-length videos with high temporal consistency, plausible scene
transitions, and controllable streaming storylines. While existing long-video
methods attempt to mitigate accumulated errors via handcrafted anti-drifting
(e.g., modified noise scheduler, frame anchoring), they remain limited to
single-prompt extrapolation, producing homogeneous scenes with repetitive
motions. We identify that the fundamental challenge extends beyond error
accumulation to a critical discrepancy between the training assumption (seeing
clean data) and the test-time autoregressive reality (conditioning on
self-generated, error-prone outputs). To bridge this hypothesis gap, SVI
incorporates Error-Recycling Fine-Tuning, a new type of efficient training that
recycles the Diffusion Transformer (DiT)'s self-generated errors into
supervisory prompts, thereby encouraging DiT to actively identify and correct
its own errors. This is achieved by injecting, collecting, and banking errors
through closed-loop recycling, autoregressively learning from error-injected
feedback. Specifically, we (i) inject historical errors made by DiT to
intervene on clean inputs, simulating error-accumulated trajectories in flow
matching; (ii) efficiently approximate predictions with one-step bidirectional
integration and calculate errors with residuals; (iii) dynamically bank errors
into replay memory across discretized timesteps, which are resampled for new
input. SVI is able to scale videos from seconds to infinite durations with no
additional inference cost, while remaining compatible with diverse conditions
(e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks,
including consistent, creative, and conditional settings, thoroughly verifying
its versatility and state-of-the-art role.