ChatPaper.aiChatPaper

UltraViCo:突破视频扩散变换器的外推极限

UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

November 25, 2025
作者: Min Zhao, Hongzhou Zhu, Yingze Wang, Bokai Yan, Jintao Zhang, Guande He, Ling Yang, Chongxuan Li, Jun Zhu
cs.AI

摘要

尽管取得进展,视频扩散变换器仍难以泛化至超出训练时长的视频序列,这一挑战我们称之为视频长度外推。我们识别出两种失效模式:模型特有的周期性内容重复,以及普遍存在的质量退化。先前研究尝试通过位置编码解决重复问题,却忽视了质量退化且仅实现有限的外推能力。本文从更基础的视角——注意力图谱重新审视这一挑战,该图谱直接决定了上下文如何影响输出。我们发现两种失效模式源于同一根本原因:注意力分散,即超出训练时窗的标记点会稀释已学习的注意力模式。这导致质量退化,而当这种分散在位置编码的谐波特性作用下形成周期性注意力模式时,重复现象便作为特例出现。基于此洞见,我们提出UltraViCo,一种无需训练即插即用的方法,通过恒定衰减因子抑制训练时窗外标记点的注意力。通过协同解决两种失效模式,我们在多种模型和外推比例下显著超越现有基线方法,将外推极限从2倍提升至4倍。值得注意的是,在4倍外推时,其动态程度和成像质量相较之前最佳方法分别提升233%和40.5%。此外,我们的方法能无缝泛化至可控视频生成与编辑等下游任务。
English
Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings. Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2x to 4x. Remarkably, it improves Dynamic Degree and Imaging Quality by 233% and 40.5% over the previous best method at 4x extrapolation. Furthermore, our method generalizes seamlessly to downstream tasks such as controllable video synthesis and editing.
PDF152December 1, 2025