通过上下文稀疏注意力实现闪电式统一视频编辑
Lightning Unified Video Editing via In-Context Sparse Attention
May 6, 2026
作者: Shitong Shao, Zikai Zhou, Haopeng Li, Yingwei Song, Wenliang Zhong, Lichen Bai, Zeke Xie
cs.AI
摘要
视频编辑已向上下文学习(ICL)范式演进,但由此产生的平方级注意力计算成本形成了关键的计算瓶颈。本文提出上下文稀疏注意力(ISA)框架,这是首个专为ICL视频编辑定制的近无损经验稀疏框架。我们的设计基于两个关键发现:首先,上下文令牌的显著性远低于源令牌;其次,我们通过理论证明和实验验证了查询锐度与近似误差存在相关性。基于这些发现,ISA实现了高效的预筛选策略来修剪冗余上下文,继而采用动态查询分组机制——将高误差查询路由至全注意力计算,而低误差查询则交由计算高效的零阶泰勒稀疏注意力处理。此外,我们通过ISA构建了新颖的闪电视频编辑模型LIVEditor,并设计了视频编辑数据管道,构建了包含170万高质量样本的数据集。大量实验表明,LIVEditor在注意力模块延迟降低约60%的同时,在EditVerseBench、IVE-Bench和VIE-Bench基准上全面超越现有最优方法,实现了视觉保真度无损的加速效果。
English
Video editing has evolved toward In-Context Learning (ICL) paradigms, yet the resulting quadratic attention costs create a critical computational bottleneck. In this work, we propose In-context Sparse Attention (ISA), the first near-lossless empirical sparse framework tailored for ICL video editing. Our design is grounded in two key insights: first, context tokens exhibit significantly lower saliency than source tokens; second, we theoretically prove and empirically validate that Query sharpness correlates with approximation error. Motivated by these findings, ISA implements an efficient pre-selection strategy to prune redundant context, followed by a dynamic query grouping mechanism that routes high-error queries to full attention and low-error ones to a computationally efficient 0-th order Taylor sparse attention. Furthermore, we build \texttt{LIVEditor} , a novel lightning video editing model via ISA and a proposed video-editing data pipeline that curated a 1.7M high-quality dataset. Extensive experiments demonstrate that LIVEditor achieves a sim60% reduction in attention-module latency while surpassing state-of-the-art methods across EditVerseBench, IVE-Bench, and VIE-Bench, delivering near-lossless acceleration without compromising visual fidelity.