透過上下文稀疏注意力實現閃電般統一的影片編輯

摘要

影片剪輯已逐步邁向情境學習（ICL）的典範轉移，然而由此產生的二次方注意力成本卻形成關鍵的計算瓶頸。本研究提出情境稀疏注意力（ISA），首創專為ICL影片剪輯設計的近無損經驗性稀疏框架。我們的設計基於兩項關鍵發現：首先，情境標記的顯著性遠低於來源標記；其次，我們從理論證明並實證驗證了查詢銳利度與近似誤差具相關性。基於這些發現，ISA實施高效預選策略以修剪冗餘情境，接著透過動態查詢分組機制，將高誤差查詢導向完整注意力計算，低誤差查詢則路由至計算高效的零階泰勒稀疏注意力。此外，我們透過ISA與新建的影片剪輯資料管線構建了\texttt{LIVEditor}閃電影片剪輯模型，該管線精選出170萬筆高品質資料集。大量實驗表明，LIVEditor在注意力模組延遲降低約60%的同時，於EditVerseBench、IVE-Bench與VIE-Bench三大基準測試中超越現有頂尖方法，實現視覺保真度無損的近無損加速效能。

English

Video editing has evolved toward In-Context Learning (ICL) paradigms, yet the resulting quadratic attention costs create a critical computational bottleneck. In this work, we propose In-context Sparse Attention (ISA), the first near-lossless empirical sparse framework tailored for ICL video editing. Our design is grounded in two key insights: first, context tokens exhibit significantly lower saliency than source tokens; second, we theoretically prove and empirically validate that Query sharpness correlates with approximation error. Motivated by these findings, ISA implements an efficient pre-selection strategy to prune redundant context, followed by a dynamic query grouping mechanism that routes high-error queries to full attention and low-error ones to a computationally efficient 0-th order Taylor sparse attention. Furthermore, we build \texttt{LIVEditor} , a novel lightning video editing model via ISA and a proposed video-editing data pipeline that curated a 1.7M high-quality dataset. Extensive experiments demonstrate that LIVEditor achieves a sim60% reduction in attention-module latency while surpassing state-of-the-art methods across EditVerseBench, IVE-Bench, and VIE-Bench, delivering near-lossless acceleration without compromising visual fidelity.

透過上下文稀疏注意力實現閃電般統一的影片編輯

Lightning Unified Video Editing via In-Context Sparse Attention

摘要

Support