透過上下文稀疏注意力實現閃電般統一的影片編輯
Lightning Unified Video Editing via In-Context Sparse Attention
May 6, 2026
作者: Shitong Shao, Zikai Zhou, Haopeng Li, Yingwei Song, Wenliang Zhong, Lichen Bai, Zeke Xie
cs.AI
摘要
影片剪輯已逐步邁向情境學習(ICL)的典範轉移,然而由此產生的二次方注意力成本卻形成關鍵的計算瓶頸。本研究提出情境稀疏注意力(ISA),首創專為ICL影片剪輯設計的近無損經驗性稀疏框架。我們的設計基於兩項關鍵發現:首先,情境標記的顯著性遠低於來源標記;其次,我們從理論證明並實證驗證了查詢銳利度與近似誤差具相關性。基於這些發現,ISA實施高效預選策略以修剪冗餘情境,接著透過動態查詢分組機制,將高誤差查詢導向完整注意力計算,低誤差查詢則路由至計算高效的零階泰勒稀疏注意力。此外,我們透過ISA與新建的影片剪輯資料管線構建了\texttt{LIVEditor}閃電影片剪輯模型,該管線精選出170萬筆高品質資料集。大量實驗表明,LIVEditor在注意力模組延遲降低約60%的同時,於EditVerseBench、IVE-Bench與VIE-Bench三大基準測試中超越現有頂尖方法,實現視覺保真度無損的近無損加速效能。
English
Video editing has evolved toward In-Context Learning (ICL) paradigms, yet the resulting quadratic attention costs create a critical computational bottleneck. In this work, we propose In-context Sparse Attention (ISA), the first near-lossless empirical sparse framework tailored for ICL video editing. Our design is grounded in two key insights: first, context tokens exhibit significantly lower saliency than source tokens; second, we theoretically prove and empirically validate that Query sharpness correlates with approximation error. Motivated by these findings, ISA implements an efficient pre-selection strategy to prune redundant context, followed by a dynamic query grouping mechanism that routes high-error queries to full attention and low-error ones to a computationally efficient 0-th order Taylor sparse attention. Furthermore, we build \texttt{LIVEditor} , a novel lightning video editing model via ISA and a proposed video-editing data pipeline that curated a 1.7M high-quality dataset. Extensive experiments demonstrate that LIVEditor achieves a sim60% reduction in attention-module latency while surpassing state-of-the-art methods across EditVerseBench, IVE-Bench, and VIE-Bench, delivering near-lossless acceleration without compromising visual fidelity.