ChatPaper.aiChatPaper

穩定視頻無限:基於錯誤回收的無限長度視頻生成

Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

October 10, 2025
作者: Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, Alexandre Alahi
cs.AI

摘要

我们提出了稳定视频无限生成技术(Stable Video Infinity, SVI),该技术能够生成具有高度时间一致性、合理场景转换及可控流媒体故事情节的无限长度视频。尽管现有的长视频生成方法试图通过手工设计的抗漂移策略(如改进的噪声调度器、帧锚定)来缓解累积误差,但它们仍局限于单一提示的外推,生成具有重复动作的同质场景。我们认识到,根本挑战不仅在于误差累积,更在于训练假设(观察干净数据)与测试时自回归现实(基于自生成、易出错输出进行条件化)之间的关键差异。为弥合这一假设差距,SVI引入了误差回收微调(Error-Recycling Fine-Tuning),这是一种新型高效训练方法,它将扩散变换器(Diffusion Transformer, DiT)自生成的误差回收为监督提示,从而激励DiT主动识别并纠正自身错误。这一目标通过闭环回收中的误差注入、收集与存储,以及从误差注入反馈中自回归学习来实现。具体而言,我们(i)注入DiT的历史误差以干预干净输入,模拟流匹配中的误差累积轨迹;(ii)通过一步双向积分高效近似预测,并利用残差计算误差;(iii)在离散时间步长上动态将误差存入回放记忆库,这些误差被重新采样用于新输入。SVI能够在无需额外推理成本的情况下,将视频从秒级扩展至无限时长,同时保持与多种条件(如音频、骨架和文本流)的兼容性。我们在三个基准测试上评估了SVI,包括一致性、创造性和条件化设置,全面验证了其多功能性及领先地位。
English
We propose Stable Video Infinity (SVI) that is able to generate infinite-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines. While existing long-video methods attempt to mitigate accumulated errors via handcrafted anti-drifting (e.g., modified noise scheduler, frame anchoring), they remain limited to single-prompt extrapolation, producing homogeneous scenes with repetitive motions. We identify that the fundamental challenge extends beyond error accumulation to a critical discrepancy between the training assumption (seeing clean data) and the test-time autoregressive reality (conditioning on self-generated, error-prone outputs). To bridge this hypothesis gap, SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer (DiT)'s self-generated errors into supervisory prompts, thereby encouraging DiT to actively identify and correct its own errors. This is achieved by injecting, collecting, and banking errors through closed-loop recycling, autoregressively learning from error-injected feedback. Specifically, we (i) inject historical errors made by DiT to intervene on clean inputs, simulating error-accumulated trajectories in flow matching; (ii) efficiently approximate predictions with one-step bidirectional integration and calculate errors with residuals; (iii) dynamically bank errors into replay memory across discretized timesteps, which are resampled for new input. SVI is able to scale videos from seconds to infinite durations with no additional inference cost, while remaining compatible with diverse conditions (e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role.
PDF112October 14, 2025