ChatPaper.aiChatPaper

UltraViCo:突破视频扩散变换器的外推能力极限

UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

November 25, 2025
作者: Min Zhao, Hongzhou Zhu, Yingze Wang, Bokai Yan, Jintao Zhang, Guande He, Ling Yang, Chongxuan Li, Jun Zhu
cs.AI

摘要

尽管已有进展,视频扩散变换器仍难以泛化至超出训练时长的视频,这一挑战我们称之为视频长度外推。我们发现两种失效模式:模型特有的周期性内容重复,以及普遍存在的质量下降。先前研究尝试通过位置编码解决重复问题,却忽视了质量下降且仅实现有限外推。本文从更本质的视角——直接控制上下文如何影响输出的注意力图谱——重新审视这一挑战。我们发现两种失效模式源于同一根本原因:注意力分散,即超出训练时长的标记点稀释了已学习的注意力模式。这导致质量下降,而当位置编码的谐波特性诱导这种分散形成周期性注意力模式时,重复现象便作为特例出现。基于此洞见,我们提出UltraViCo,一种无需训练即插即用的方法,通过恒定衰减因子抑制超出训练窗口的标记点注意力。通过协同解决两种失效模式,我们在多种模型和外推比率下大幅超越现有基线方法,将外推极限从2倍提升至4倍。值得注意的是,在4倍外推时,本方法将动态程度与成像质量较先前最优方法分别提升233%和40.5%。此外,我们的方法可无缝泛化至可控视频生成与编辑等下游任务。
English
Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings. Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2x to 4x. Remarkably, it improves Dynamic Degree and Imaging Quality by 233% and 40.5% over the previous best method at 4x extrapolation. Furthermore, our method generalizes seamlessly to downstream tasks such as controllable video synthesis and editing.
PDF152December 1, 2025