通过上下文稀疏注意力实现闪电式统一视频编辑

摘要

视频编辑已向上下文学习（ICL）范式演进，但由此产生的平方级注意力计算成本形成了关键的计算瓶颈。本文提出上下文稀疏注意力（ISA）框架，这是首个专为ICL视频编辑定制的近无损经验稀疏框架。我们的设计基于两个关键发现：首先，上下文令牌的显著性远低于源令牌；其次，我们通过理论证明和实验验证了查询锐度与近似误差存在相关性。基于这些发现，ISA实现了高效的预筛选策略来修剪冗余上下文，继而采用动态查询分组机制——将高误差查询路由至全注意力计算，而低误差查询则交由计算高效的零阶泰勒稀疏注意力处理。此外，我们通过ISA构建了新颖的闪电视频编辑模型LIVEditor，并设计了视频编辑数据管道，构建了包含170万高质量样本的数据集。大量实验表明，LIVEditor在注意力模块延迟降低约60%的同时，在EditVerseBench、IVE-Bench和VIE-Bench基准上全面超越现有最优方法，实现了视觉保真度无损的加速效果。

English

Video editing has evolved toward In-Context Learning (ICL) paradigms, yet the resulting quadratic attention costs create a critical computational bottleneck. In this work, we propose In-context Sparse Attention (ISA), the first near-lossless empirical sparse framework tailored for ICL video editing. Our design is grounded in two key insights: first, context tokens exhibit significantly lower saliency than source tokens; second, we theoretically prove and empirically validate that Query sharpness correlates with approximation error. Motivated by these findings, ISA implements an efficient pre-selection strategy to prune redundant context, followed by a dynamic query grouping mechanism that routes high-error queries to full attention and low-error ones to a computationally efficient 0-th order Taylor sparse attention. Furthermore, we build \texttt{LIVEditor} , a novel lightning video editing model via ISA and a proposed video-editing data pipeline that curated a 1.7M high-quality dataset. Extensive experiments demonstrate that LIVEditor achieves a sim60% reduction in attention-module latency while surpassing state-of-the-art methods across EditVerseBench, IVE-Bench, and VIE-Bench, delivering near-lossless acceleration without compromising visual fidelity.

通过上下文稀疏注意力实现闪电式统一视频编辑

Lightning Unified Video Editing via In-Context Sparse Attention

摘要

Support